scispace - formally typeset
Search or ask a question

Showing papers in "Statistics and Computing in 1995"


Journal ArticleDOI
TL;DR: In this paper, the authors provide simulation algorithms for one-sided and two-sided truncated normal distributions, which are then used to simulate multivariate normal variables with convex restricted parameter space for any covariance structure.
Abstract: We provide simulation algorithms for one-sided and two-sided truncated normal distributions. These algorithms are then used to simulate multivariate normal variables with convex restricted parameter space for any covariance structure.

438 citations


Journal ArticleDOI
TL;DR: This paper reviews recent literature on techniques for obtaining random samples from databases, and describes sampling for estimation of aggregates (e.g. the size of query results).
Abstract: This paper reviews recent literature on techniques for obtaining random samples from databases. We begin with a discussion of why one would want to include sampling facilities in database management systems. We then review basic sampling techniques used in constructing DBMS sampling algorithms, e.g. acceptance/rejection and reservoir sampling. A discussion of sampling from various data structures follows: B + trees, hash files, spatial data structures (including R-trees and quadtrees). Algorithms for sampling from simple relational queries, e.g. single relational operators such as selection, intersection, union, set difference, projection, and join are then described. We then describe sampling for estimation of aggregates (e.g. the size of query results). Here we discuss both clustered sampling, and sequential sampling approaches. Decision-theoretic approaches to sampling for query optimization are reviewed.

161 citations


Journal ArticleDOI
TL;DR: This work investigates the implementation of univariate PLS algorithms within FORTRAN and the Matlab and Splus environments, and investigates the merits of using the orthogonal invariance of PLS regression to ‘improve’ the algorithms.
Abstract: Partial least squares (PLS) regression has been proposed as an alternative regression technique to more traditional approaches such as principal components regression and ridge regression. A number of algorithms have appeared in the literature which have been shown to be equivalent. Someone wishing to implement PLS regression in a programming language or within a statistical package must choose which algorithm to use. We investigate the implementation of univariate PLS algorithms within FORTRAN and the Matlab (1993) and Splus (1992) environments, comparing theoretical measures of execution speed based on flop counts with their observed execution times. We also comment on the ease with which the algorithms may be implemented in the different environments. Finally, we investigate the merits of using the orthogonal invariance of PLS regression to ‘improve’ the algorithms.

38 citations


Journal ArticleDOI
TL;DR: A Bayesian approach to mixture modelling and a method based on predictive distribution to determine the number of components in the mixtures to demonstrate the power of the Gibbs sampler.
Abstract: This paper describes a Bayesian approach to mixture modelling and a method based on predictive distribution to determine the number of components in the mixtures. The implementation is done through the use of the Gibbs sampler. The method is described through the mixtures of normal and gamma distributions. Analysis is presented in one simulated and one real data example. The Bayesian results are then compared with the likelihood approach for the two examples.

38 citations


Journal ArticleDOI
TL;DR: This paper presents an efficient approach to leave-one-out cross-validation of principal components that exploits the regular nature of leave- one-out principal component eigenvalue downdating and derives influence statistics and considers the application to principal component regression.
Abstract: The cross-validation of principal components is a problem that occurs in many applications of statistics. The naive approach of omitting each observation in turn and repeating the principal component calculations is computationally costly. In this paper we present an efficient approach to leave-one-out cross-validation of principal components. This approach exploits the regular nature of leave-one-out principal component eigenvalue downdating. We derive influence statistics and consider the application to principal component regression.

27 citations


Journal ArticleDOI
TL;DR: This paper deals with techniques for obtaining random point samples from spatial databases by discussing two fundamental approaches to sampling with spatial predicates, depending on whether the authors sample first or evaluate the predicate first.
Abstract: This paper deals with techniques for obtaining random point samples from spatial databases. We seek random points from a continuous domain (usually ℝ2) which satisfy a spatial predicate that is represented in the database as a collection of polygons. Several applications of spatial sampling (e.g. environmental monitoring, agronomy, forestry, etc) are described. Sampling problems are characterized in terms of two key parameters: coverage (selectivity), and expected stabbing number (overlap). We discuss two fundamental approaches to sampling with spatial predicates, depending on whether we sample first or evaluate the predicate first. The approaches are described in the context of both quadtrees and R-trees, detailing the sample first, acceptance/rejection tree, and partial area tree algorithms. A sequential algorithm, the one-pass spatial reservoir algorithm is also described. The relative performance of the various sampling algorithms is compared and choice of preferred algorithms is suggested. We conclude with a short discussion of possible extensions.

27 citations


Journal ArticleDOI
TL;DR: In this paper, the K principal points of a p-variate random variable X are defined as those points ξ 1,..., ξ 2, ξ 3, K 4, K 5 which minimize the expected squared distance of X from the nearest of the ξ k ≥ 0.
Abstract: The K principal points of a p-variate random variable X are defined as those points ξ1,...,ξ K which minimize the expected squared distance of X from the nearest of the ξ k . This paper reviews some of the theory of principal points and presents a method of determining principal points of univariate continuous distributions. The method is applied to the uniform distribution, to the normal distribution and to the exponential distribution.

27 citations


Journal ArticleDOI
TL;DR: The simplest construction of bootstrap likelihoods involves two levels of bootstrapping, kernel density estimation, and non-parametric curve-smoothing as mentioned in this paper, which is the most accurate and efficient approach.
Abstract: The simplest construction of bootstrap likelihoods involves two levels of bootstrapping, kernel density estimation, and non-parametric curve-smoothing. We describe more accurate and efficient constructions, based on smoothing at the first level of nested bootstraps and saddlepoint approximation to remove second-level bootstrap variation. Detailed illustrations are given.

22 citations


Journal ArticleDOI
Rose Baker1
TL;DR: The F-ratio test for equality of dispersion in two samples is by no means robust, while non-parametric tests either assume a common median, or are not very powerful.
Abstract: The F-ratio test for equality of dispersion in two samples is by no means robust, while non-parametric tests either assume a common median, or are not very powerful. Two new permutation tests are presented, which do not suffer from either of these problems. Algorithms for Monte Carlo calculation of P values and confidence intervals are given, and the performance of the tests are studied and compared using Monte Carlo simulations for a range of distributional types. The methods used to speed up Monte Carlo calculations, e.g. stratification, are of wider applicability.

15 citations


Journal ArticleDOI
TL;DR: In this paper, an image reconstruction problem and the computational difficulties arising in determining the maximum a posteriori (MAP) estimate are described, and two algorithms, iterated conditional modes (ICM) and simulated annealing, are usually applied pixel by pixel.
Abstract: We describe an image reconstruction problem and the computational difficulties arising in determining the maximum a posteriori (MAP) estimate. Two algorithms for tackling the problem, iterated conditional modes (ICM) and simulated annealing, are usually applied pixel by pixel. The performance of this strategy can be poor, particularly for heavily degraded images, and as a potential improvement Jubb and Jennison (1991) suggest the cascade algorithm in which ICM is initially applied to coarser images formed by blocking squares of pixels. In this paper we attempt to resolve certain criticisms of cascade and present a version of the algorithm extended in definition and implementation. As an illustration we apply our new method to a synthetic aperture radar (SAR) image. We also carry out a study of simulated annealing, with and without cascade, applied to a more tractable minimization problem from which we gain insight into the properties of cascade algorithms.

13 citations


Journal ArticleDOI
TL;DR: An extension of the existing models, which removes this constraint, is proposed and the resulting model is semi-parametric and requires computationally intensive techniques for likelihood evaluation.
Abstract: Threshold methods for multivariate extreme values are based on the use of asymptotically justified approximations of both the marginal distributions and the dependence structure in the joint tail. Models derived from these approximations are fitted to a region of the observed joint tail which is determined by suitably chosen high thresholds. A drawback of the existing methods is the necessity for the same thresholds to be taken for the convergence of both marginal and dependence aspects, which can result in inefficient estimation. In this paper an extension of the existing models, which removes this constraint, is proposed. The resulting model is semi-parametric and requires computationally intensive techniques for likelihood evaluation. The methods are illustrated using a coastal engineering application.

Journal ArticleDOI
TL;DR: The computer language [B/D] (an acronym for beliefs adjusted by data), which implements Bayes linear methods, incorporates a natural graphical representation of the analysis, providing a powerful way of thinking about the process of knowledge formulation and criticism which is also accessible to non-technical users.
Abstract: We demonstrate how Bayes linear methods, based on partial prior specifications, bring us quickly to the heart of otherwise complex problems, giving us natural and systematic tools for evaluating our analyses which are not readily available in the usual Bayes formalism. We illustrate the approach using an example concerning problems of prediction in a large brewery. We describe the computer language [B/D] (an acronym for beliefs adjusted by data), which implements the approach. [B/D] incorporates a natural graphical representation of the analysis, providing a powerful way of thinking about the process of knowledge formulation and criticism which is also accessible to non-technical users.

Journal ArticleDOI
TL;DR: In this paper, a Monte Carlo exact conditional test of quasi-independence in two-way incomplete contingency tables is proposed, which depends only on the counts in the cells of interest and not on the remaining cells.
Abstract: A Monte Carlo exact conditional test of quasi-independence in two-way incomplete contingency tables is proposed. The null distribution of a random table under quasiindependence is derived. This distribution depends only on the counts in the cells of interest and not on the counts in the remaining cells. This result is used to improve the efficiency of a proposed simulate-and-reject Monte Carlo procedure for estimating the attained significance level.



Journal ArticleDOI
TL;DR: In this paper, the authors describe a method due to Lindsey (1974a) for fitting different exponential family distributions for a single population to the same data, using Poisson log-linear modelling of the density or mass function.
Abstract: This paper describes a method due to Lindsey (1974a) for fitting different exponential family distributions for a single population to the same data, using Poisson log-linear modelling of the density or mass function. The method is extended to Efron's (1986) double exponential family, giving exact ML estimation of the two parameters not easily achievable directly. The problem of comparing the fit of the non-nested models is addressed by both Bayes and posterior Bayes factors (Aitkin, 1991). The latter allow direct comparisons of deviances from the fitted distributions.

Journal ArticleDOI
TL;DR: A uniform method of testing randomness in binary strings is described based on using the binary derivative and it is shown that the new tests are faster and more powerful than several of the well-established tests for randomness.
Abstract: The binary derivative has been used to measure the randomness of a binary string formed by a pseudorandom number generator for use in cipher systems. In this paper we develop statistical properties of the binary derivative and show that certain types of randomness testing in binary derivatives are equivalent to well-established tests for randomness in the original string. A uniform method of testing randomness in binary strings is described based on using the binary derivative. We show that the new tests are faster and more powerful than several of the well-established tests for randomness.

Journal ArticleDOI
TL;DR: In this article, an alternative method using substitution sampling is given, and an algorithm for the random variate generation from SD-distributions is also given, where substitution sampling has been used for the first time.
Abstract: Laud et al. (1993) describe a method for random variate generation from D-distributions. In this paper an alternative method using substitution sampling is given. An algorithm for the random variate generation from SD-distributions is also given.

Journal ArticleDOI
TL;DR: A self-validating numerical method based on interval analysis for the computation of central and non-central F probabilities and percentiles is reported and can be adapted to approximate the probabilities and%iles for other commonly used distribution functions.
Abstract: A self-validating numerical method based on interval analysis for the computation of central and non-central F probabilities and percentiles is reported. The major advantage of this approach is that there are guaranteed error bounds associated with the computed values (or intervals), i.e. the computed values satisfy the user-specified accuracy requirements. The methodology reported in this paper can be adapted to approximate the probabilities and percentiles for other commonly used distribution functions.

Journal ArticleDOI
TL;DR: A review of a number of data models for aggregate statistical data which have appeared in the computer science literature in the last ten years are given.
Abstract: The paper gives a review of a number of data models for aggregate statistical data which have appeared in the computer science literature in the last ten years. After a brief introduction to the data model in general, the fundamental concepts of statistical data are introduced. These are called statistical objects because they are complex data structures (vectors, matrices, relations, time series, etc) which may have different possible representations (e.g. tables, relations, vectors, pie-charts, bar-charts, graphs, and so on). For this reason a statistical object is defined by two different types of attribute (a summary attribute, with its own summary type and with its own instances, called summary data, and the set of category attributes, which describe the summary attribute). Some conceptual models of statistical data (CSM, SDM4S), some semantic models of statistical data (SCM, SAM*, OSAM*), and some graphical models of statistical data (SUBJECT, GRASS, STORM) are also discussed.

Journal ArticleDOI
TL;DR: In this paper, a simple, automatic and adaptive bivariate density estimator is proposed based on the estimation of marginal and conditional densities, and guidance to practical application of the method is given.
Abstract: The standard approach to non-parametric bivariate density estimation is to use a kernel density estimator. Practical performance of this estimator is hindered by the fact that the estimator is not adaptive (in the sense that the level of smoothing is not sensitive to local properties of the density). In this paper a simple, automatic and adaptive bivariate density estimator is proposed based on the estimation of marginal and conditional densities. Asymptotic properties of the estimator are examined, and guidance to practical application of the method is given. Application to two examples illustrates the usefulness of the estimator as an exploratory tool, particularly in situations where the local behaviour of the density varies widely. The proposed estimator is also appropriate for use as a ‘pilot’ estimate for an adaptive kernel estimate, since it is relatively inexpensive to calculate.

Journal ArticleDOI
TL;DR: In this article, a methodology for reducing the bias inherent in density estimation can be developed for models with an additive error component, and an illustrative example involving a stochastic model of molecular fragmentation and measurement is given.
Abstract: Some statistical models defined in terms of a generating stochastic mechanism have intractable distribution theory, which renders parameter estimation difficult. However, a Monte Carlo estimate of the log-likelihood surface for such a model can be obtained via computation of nonparametric density estimates from simulated realizations of the model. Unfortunately, the bias inherent in density estimation can cause bias in the resulting log-likelihood estimate that alters the location of its maximizer. In this paper a methodology for radically reducing this bias is developed for models with an additive error component. An illustrative example involving a stochastic model of molecular fragmentation and measurement is given.

Journal ArticleDOI
TL;DR: Diagnostic biplots on their own do not seem to provide a sufficient means for model selection for genotype by environment tables, but in combination with other methods they certainly can provide extra insight into the structure of the data.
Abstract: Popular rank-2 and rank-3 models for two-way tables have geometrical properties which can be used as diagnostic keys in screening for an appropriate model. Row and column levels of two-way tables are represented by points in two or three dimensional space, whereupon collinearity and coplanarity of row and column points provide diagnostic keys for informal model choice. Coordinates are obtained from a factorization of the two-way table Y in the matrix product UV T. The rows of U then contain row-point coordinates and the rows of V column-point coordinates. Illustrations of applications of diagnostic biplots in the literature were restricted to data from chemistry and physics with little or no noise. In plant breeding, two-way tables containing substantial amounts of noise regularly arise in the form of genotype by environment tables. To investigate the usefulness of diagnostic biplots for model screening for genotype by environment tables, data tables were generated from a range of two-way models under the addition of various amounts of noise. Chances for correct diagnosis of the generating model depended on the type of model. Diagnostic biplots on their own do not seem to provide a sufficient means for model selection for genotype by environment tables, but in combination with other methods they certainly can provide extra insight into the structure of the data.

Journal ArticleDOI
TL;DR: The new heuristic and relaxed heuristic algorithms proposed in this paper are shown to find computationally more efficient elimination orderings than previously proposed heuristicgorithms.
Abstract: The computational cost, in both storage requirements and calculation, of performing an elimination ordering is considered as a function of the order in which the vertices of a graph are eliminated. Several different heuristic and relaxed heuristic algorithms for finding low cost elimination orderings are described and compared. The new heuristic and relaxed heuristic algorithms proposed in this paper are shown to find computationally more efficient elimination orderings than previously proposed heuristic algorithms.

Journal ArticleDOI
TL;DR: In this paper, a general idea based on variate transformations which can be tailored in all the above methods and increase the Gibbs sampler efficiency is presented. And a simple technique to assess convergence is suggested and illustrative examples are presented.
Abstract: In the non-conjugate Gibbs sampler, the required sampling from the full conditional densities needs the adoption of black-box sampling methods. Recent suggestions include rejection sampling, adaptive rejection sampling, generalized ratio of uniforms, and the Griddy-Gibbs sampler. This paper describes a general idea based on variate transformations which can be tailored in all the above methods and increase the Gibbs sampler efficiency. Moreover, a simple technique to assess convergence is suggested and illustrative examples are presented.

Journal ArticleDOI
TL;DR: An algorithm is presented for simulating disease data in pedigrees, incorporating variable age at onset and genetic and environmental effects, and is computationally efficient, making multi-dataset simulation studies feasible.
Abstract: The field of genetic epidemiology is growing rapidly with the realization that many important diseases are influenced by both genetic and environmental factors. For this reason, pedigree data are becoming increasingly valuable as a means of studying patterns of disease occurrence. Analysis of pedigree data is complicated by the lack of independence among family members and by the non-random sampling schemes used to ascertain families. An additional complicating factor is the variability in age at disease onset from one person to another. In developing statistical methods for analysing pedigree data, analytic results are often intractable, making simulation studies imperative for assessing the performance of proposed methods and estimators. In this paper, an algorithm is presented for simulating disease data in pedigrees, incorporating variable age at onset and genetic and environmental effects. Computational formulas are developed in the context of a proportional hazards model and assuming single ascertainment of families, but the methods can be easily generalized to alternative models. The algorithm is computationally efficient, making multi-dataset simulation studies feasible. Numerical examples are provided to demonstrate the methods.

Journal ArticleDOI
TL;DR: The types of metadata encountered and the problems associated with dealing with them are discussed, and an alternative approach based on textual markup rather than, for example, the relational model is described.
Abstract: With many types of scientific data, the amount of descriptive and qualifying information associated with the data values is quite variable and potentially large compared with the number of actual data values. This problem has been found to be particularly acute when dealing with data about the nutrient composition of foods, and a system—based on textual markup rather than, for example, the relational model—has been developed to deal with it. This paper discusses the types of metadata encountered and the problems associated with dealing with them, and then describes this alternative approach. The approach described has been installed in several locations around the world, and is in preliminary use as a tool for interchanging data among different databases as well as local database management.

Journal ArticleDOI
TL;DR: The discussion has shown, I fear, that the formulation of linear models, one of the basic tools of the authors' trade, is in an unsatisfactory state; also that the argument is not just one between Nelder and the rest of the world.
Abstract: The discussion has shown, I fear, that the formulation of linear models, one of the basic tools of our trade, is in an unsatisfactory state; also that the argument is not just one between Nelder and the rest of the world. Why the sorry state? I believe that the problem is that statistical theory is driven too often by mathematics rather than by the requirements of scientific inference. None of the hypotheses I have described as uninteresting are mathematically wrong; rather I contend that they do not correspond to any inferential quantities of interest. Searle points out that in the first example of Section 5 the original authors did not specify a nested structure of the type I discussed. However, without such a prior nested structure the hypotheses being discussed are, to me, pointless. They are mathematically specifiable but do not correspond to anything of inferential interest. Even with the prior structure two of the three remain uninteresting. The contributions of Aitkin and van Eeuwijk show what a mess the applied literature is in, as the result of some of the confusions I discussed. This confusion is not something that statisticians can or should ignore.


Journal ArticleDOI
TL;DR: This paper considers data compression through the use of wavelet analysis followed by a thresholding of small coefficients in the resulting multiresolution decomposition, which shows promise in better preserving certain structures in particular sound data.
Abstract: In this paper we consider data on underwater sounds of differing types. Our objective is to filter background noise and achieve an acceptable level of reduction in the raw data, whilst at the same time maintaining the main features of the original signal. In particular, we consider data compression through the use of wavelet analysis followed by a thresholding of small coefficients in the resulting multiresolution decomposition. Various methods to threshold the wavelet representation are discussed and compared using recordings of dolphin sounds. An empirical modification to one of them is also proposed which shows promise in better preserving certain structures in our particular sound data.