scispace - formally typeset
Search or ask a question

Showing papers on "Sampling (statistics) published in 2003"


Journal ArticleDOI
TL;DR: The following techniques for uncertainty and sensitivity analysis are briefly summarized: Monte Carlo analysis, differential analysis, response surface methodology, Fourier amplitude sensitivity test, Sobol' variance decomposition, and fast probability integration.

1,780 citations


Journal ArticleDOI
TL;DR: The necessity and usefulness of multivariate statistical assessment of large and complex databases in order to get better information about the quality of surface water, the design of sampling and analytical protocols and the effective pollution control/management of the surface waters is presented.

1,136 citations


Journal ArticleDOI
TL;DR: In this paper, a range of monitoring techniques are used to measure pollutant concentrations in urban street canyons, such as continuous monitoring, passive and active pre-concentration sampling, and grab sampling.

1,003 citations


Book ChapterDOI
01 Jan 2003
TL;DR: In this article, Monte Carlo sampling methods for solving large scale stochastic programming problems are discussed, where a random sample is generated outside of an optimization procedure, and then the constructed, so-called sample average approximation (SAA), problem is solved by an appropriate deterministic algorithm.
Abstract: In this chapter we discuss Monte Carlo sampling methods for solving large scale stochastic programming problems We concentrate on the “exterior” approach where a random sample is generated outside of an optimization procedure, and then the constructed, so-called sample average approximation (SAA), problem is solved by an appropriate deterministic algorithm We study statistical properties of the obtained SAA estimators The developed statistical inference is incorporated into validation analysis and error estimation We describe some variance reduction techniques which may enhance convergence of sampling based estimates We also discuss difficulties in extending this methodology to multistage stochastic programming Finally, we briefly discuss the SAA method applied to stochastic generalized equations and variational inequalities

990 citations


Journal ArticleDOI
01 Jan 2003
TL;DR: In this paper, a Markov chain is constructed by alternating uniform sampling in the vertical direction with uniform sampling from the horizontal "slice" defined by the current vertical position, or more generally, with some update that leaves the uniform distribution over this slice invariant.
Abstract: Markov chain sampling methods that adapt to characteristics of the distribution being sampled can be constructed using the principle that one can ample from a distribution by sampling uniformly from the region under the plot of its density function. A Markov chain that converges to this uniform distribution can be constructed by alternating uniform sampling in the vertical direction with uniform sampling from the horizontal "slice" defined by the current vertical position, or more generally, with some update that leaves the uniform distribution over this slice invariant. Such "slice sampling" methods are easily implemented for univariate distributions, and can be used to sample from a multivariate distribution by updating each variable in turn. This approach is often easier to implement than Gibbs sampling and more efficient than simple Metropolis updates, due to the ability of slice sampling to adaptively choose the magnitude of changes made. It is therefore attractive for routine and automated use. Slice sampling methods that update all variables simultaneously are also possible. These methods can adaptively choose the magnitudes of changes made to each variable, based on the local properties of the density function. More ambitiously, such methods could potentially adapt to the dependencies between variables by constructing local quadratic approximations. Another approach is to improve sampling efficiency by suppressing random walks. This can be done for univariate slice sampling by "overrelaxation," and for multivariate slice sampling by "reflection" from the edges of the slice.

968 citations


Book
24 Feb 2003
TL;DR: This book discusses the evolution of Survey Process Quality and its implications for Questionnaire Design, as well as practical Survey Design for Minimizing Total Survey Error.
Abstract: Preface. Chapter 1. The Evolution of Survey Process Quality. 1.1 The Concept of a Survey. 1.2 Types of Surveys. 1.3 Brief History of Survey Methodology. 1.4 The Quality Revolution. 1.5 Definitions of Quality and Quality in Statistical Organizations. 1.6 Measuring Quality. 1.7 Improving Quality. 1.8 Quality in a Nutshell. Chapter 2. The Survey Process and Data Quality. 2.1 Overview of the Survey Process. 2.2 Data Quality and Total Survey Error. 2.3 Decomposing Nonsampling Error into Its Component Parts. 2.4 Gauging the Magnitude of Total Survey Error. 2.5 Mean Squared Error. 2.6 An Illustration of the Concepts. Chapter 3. Coverage and Nonresponse Error. 3.1 Coverage Error. 3.2 Measures of Coverage Bias. 3.3 Reducing Coverage Bias. 3.4 Unit Nonresponse Error. 3.5 Calculating Response Rates. 3.6 Reducing Nonresponse Bias. Chapter 4. The Measurement Process and Its Implications for Questionnaire Design. 4.1Components of Measurement Error. 4.2 Errors Arising from the Questionnaire Design. 4.3 Understanding the Response Process. Chapter 5. Errors Due to Interviewers and Interviewing. 5.1 Role of the Interviewer. 5.2 Interviewer Variability. 5.3 Design Factors that Influence Interviewer Effects. 5.4 Evaluation of Interviewer Performance. Chapter 6. Data Collection Modes and Associated Errors. 6.1 Modes of Data Collection. 6.2 Decision Regarding Mode. 6.3 Some Examples of Mode Effects. Chapter 7. Data Processing: Errors and Their Control. 7.1 Overview of Data Processing Steps. 7.2 Nature of Data Processing Error. 7.3 Data Capture Errors. 7.4 Post-Data Capture Editing. 7.5 Coding. 7.6 File Preparation. 7.7 Applications of Continuous Quality Improvement: The Case of Coding. 7.8 Integration Activities. Chapter 8. Overview of Survey Error Evaluation Methods. 8.1 Purposes of Survey Error Evaluation. 8.2 Evaluation Methods for Designing and Pretesting Surveys. 8.3 Methods for Monitoring and Controlling Data Quality. 8.4 Postsurvey Evaluations. 8.5 Summary of Evaluation Methods. Chapter 9. Sampling Error. 9.1 Brief History of Sampling. 9.2 Nonrandom Sampling Methods. 9.3 Simple Random Sampling. 9.4 Statistical Inference in the Presence of Nonsampling Errors. 9.5 Other Methods of Random Sampling. 9.6 Concluding Remarks. Chapter 10.1 Practical Survey Design for Minimizing Total Survey Error. 10.1 Balance Between Cost, Survey Error, and Other Quality Features. 10.2 Planning a Survey for Optimal Quality. 10.3 Documenting Survey Quality. 10.4 Organizational Issues Related to Survey Quality. References. Index.

795 citations


Journal ArticleDOI
TL;DR: In this article, it was shown that more than 50% of the computer effort can be saved by using Latin hypercubes instead of simple Monte Carlo in importance sampling, however, the exact savings are dependent on details in the use of Latin Hypercubes and on the shape of the failure surfaces of the problems.

586 citations


Journal ArticleDOI
TL;DR: A statistical algorithm to sample rigorously and exactly from the Boltzmann ensemble of secondary structures is presented, showing that a sample of moderate size from the ensemble of an enormous number of possible structures is sufficient to guarantee statistical reproducibility in the estimates of typical sampling statistics.
Abstract: An RNA molecule, particularly a long-chain mRNA, may exist as a population of structures. Further more, multiple structures have been demonstrated to play important functional roles. Thus, a representation of the ensemble of probable structures is of interest. We present a statistical algorithm to sample rigorously and exactly from the Boltzmann ensemble of secondary structures. The forward step of the algorithm computes the equilibrium partition functions of RNA secondary structures with recent thermodynamic parameters. Using conditional probabilities computed with the partition functions in a recursive sampling process, the backward step of the algorithm quickly generates a statistically representative sample of structures. With cubic run time for the forward step, quadratic run time in the worst case for the sampling step, and quadratic storage, the algorithm is efficient for broad applicability. We demonstrate that, by classifying sampled structures, the algorithm enables a statistical delineation and representation of the Boltzmann ensemble. Applications of the algorithm show that alternative biological structures are revealed through sampling. Statistical sampling provides a means to estimate the probability of any structural motif, with or without constraints. For example, the algorithm enables probability profiling of single-stranded regions in RNA secondary structure. Probability profiling for specific loop types is also illustrated. By overlaying probability profiles, a mutual accessibility plot can be displayed for predicting RNA:RNA interactions. Boltzmann probability-weighted density of states and free energy distributions of sampled structures can be readily computed. We show that a sample of moderate size from the ensemble of an enormous number of possible structures is sufficient to guarantee statistical reproducibility in the estimates of typical sampling statistics. Our applications suggest that the sampling algorithm may be well suited to prediction of mRNA structure and target accessibility. The algorithm is applicable to the rational design of small interfering RNAs (siRNAs), antisense oligonucleotides, and trans-cleaving ribozymes in gene knock-down studies.

558 citations


Posted Content
TL;DR: This work shows that the optimal sampling frequency at which to estimate the parameters of a discretely sampled continuous-time model can be finite when the observations are contaminated by market microstructure effects, and addresses the question of what to do about the presence of the noise.
Abstract: Classical statistics suggest that for inference purposes one should always use as much data as is available. We study how the presence of market microstructure noise in high-frequency financial data can change that result. We show that the optimal sampling frequency at which to estimate the parameters of a discretely sampled continuous-time model can be finite when the observations are contaminated by market microstructure effects. We then address the question of what to do about the presence of the noise. We show that modelling the noise term explicitly restores the first order statistical effect that sampling as often as possible is optimal. But, more surprisingly, we also demonstrate that this is true even if one misspecifies the assumed distribution of the noise term. Not only is it still optimal to sample as often as possible, but the estimator has the same variance as if the noise distribution had been correctly specified, implying that attempts to incorporate the noise into the analysis cannot do more harm than good. Finally, we study the same questions when the observations are sampled at random time intervals, which are an essential feature of transaction-level data.

520 citations


Journal ArticleDOI
TL;DR: In this paper, the authors show that the bias increases with sample size, and is affected by the underlying shape of the species habitat, the magnitude of errors in locations, and the spatial and temporal distribution of sampling effort.
Abstract: Minimum convex polygons (convex hulls) are an internationally accepted, standard method for estimating species’ ranges, particularly in circumstances in which presence-only data are the only kind of spatially explicit data available. One of their main strengths is their simplicity. They are used to make area statements and to assess trends in occupied habitat, and are an important part of the assessment of the conservation status of species. We show by simulation that these estimates are biased. The bias increases with sample size, and is affected by the underlying shape of the species habitat, the magnitude of errors in locations, and the spatial and temporal distribution of sampling effort. The errors affect both area statements and estimates of trends. Some of these errors may be reduced through the application of αhulls, which are generalizations of convex hulls, but they cannot be eliminated entirely. α-hulls provide an explicit means for excluding discontinuities within a species range. Strengths and weaknesses of alternatives including kernel estimators were examined. Convex hulls exhibit larger bias than α-hulls when used to quantify habitat extent and to detect changes in range, and when subject to differences in the spatial and temporal distribution of sampling effort and spatial accuracy. α-hulls should be preferred for estimating the extent of and trends in species’ ranges.

462 citations


Posted Content
TL;DR: In this paper, the authors present an approach to solve the problem of the "missing link" problem in IJOC, which is located at http://dx.doi.org/10.1287/ijoc.1050.0136
Abstract: The article of record as published may be located at http://dx.doi.org/10.1287/ijoc.1050.0136


Proceedings ArticleDOI
01 Jan 2003
TL;DR: An overview of modern design of experiments (DOE) techniques that can be applied in computational engineering design studies and several types of modern DOE methods are described including pseudo-Monte Carlo sampling, quasi-monte Carlo sampled, Latin hypercube sampling, orthogonal array sampling, and Hammersley sequence sampling.
Abstract: The intent of this paper is to provide an overview of modern design of experiments (DOE) techniques that can be applied in computational engineering design studies. The term modern refers to DOE techniques specifically designed for use with deterministic computer simulations. In addition, this term is used to contrast classical DOE techniques that were developed for laboratory and field experiments that possess random error sources. Several types of modern DOE methods are described including pseudo-Monte Carlo sampling, quasi-Monte Carlo sampling, Latin hypercube sampling, orthogonal array sampling, and Hammersley sequence sampling.

Journal ArticleDOI
TL;DR: To design and apply statistical tests for measuring sampling bias in the raw data used to the determine priority areas for conservation, and to discuss their impact on conservation analyses for the region.
Abstract: Aim To design and apply statistical tests for measuring sampling bias in the raw data used to the determine priority areas for conservation, and to discuss their impact on conservation analyses for the region. LocationSub-Saharan Africa. Methods An extensive data set comprising 78,083 vouchered locality records for 1068 passerine birds in sub-Saharan Africa has been assembled. Using geographical information systems, we designed and applied two tests to determine if sampling of these taxa was biased. First, we detected possible biases because of accessibility by measuring the proximity of each record to cities, rivers and roads. Second, we quantified the intensity of sampling of each species inside and surrounding proposed conservation priority areas and compared it with sampling intensity in non-priority areas. We applied statistical tests to determine if the distribution of these sampling records deviated significantly from random distributions. Results The analyses show that the location and intensity of collecting have historically been heavily influenced by accessibility. Sampling localities show dense, significant aggregation around city limits, and along rivers and roads. When examining the collecting sites of each individual species, the pattern of sampling has been significantly concentrated within and immediately surrounding areas now designated as conservation priorities. Main conclusions Assessment of patterns of species richness and endemicity at the scale useful for establishing conservation priorities, below the continental level, undoubtedly reflects biases in taxonomic sampling. This is especially problematic for priorities established using the criterion of complementarity because the estimated spatial costs of this approach are highly sensitive to sampling artefacts. Hence such conservation priorities should be interpreted with caution proportional to the bias found. We argue that conservation priority setting analyses require (1) statistical tests to detect these biases, and (2) data treatment to reflect species distribution rather than patterns of collecting effort.

Journal ArticleDOI
TL;DR: In this article, two artificial neural networks (ANNs), unsupervised and supervised learning algorithms, were applied to suggest practical approaches for the analysis of ecological data, and the results suggested that methodologies successively using two different neural networks are helpful to understand ecological data through ordination first, and then to predict target variables.

Proceedings ArticleDOI
10 Nov 2003
TL;DR: A hybrid sampling strategy in the PRM framework for finding paths through narrow passages is presented, which enables relatively small roadmaps to reliably capture the connectivity of configuration spaces with difficult narrow passages.
Abstract: Probabilistic roadmap (PRM) planners have been successful in path planning of robots with many degrees of freedom, but narrow passages in a robot's configuration space create significant difficulty for PRM planners. This paper presents a hybrid sampling strategy in the PRM framework for finding paths through narrow passages. A key ingredient of the new strategy is the bridge test, which boosts the sampling density inside narrow passages. The bridge test relies on simple tests of local geometry and can be implemented efficiently in high-dimensional configuration spaces. The strengths of the bridge test and uniform sampling complement each other naturally and are combined to generate the final hybrid sampling strategy. Our planner was tested on point robots and articulated robots in planar workspaces. Preliminary experiments show that the hybrid sampling strategy enables relatively small roadmaps to reliably capture the connectivity of configuration spaces with difficult narrow passages.

Proceedings ArticleDOI
09 Jul 2003
TL;DR: It is shown that when graphs are sampled using traceroute-like methods, the resulting degree distribution can differ sharply from that of the underlying graph, and why this effect arises is explored.
Abstract: Considerable attention has been focused on the properties of graphs derived from Internet measurements. Router-level topologies collected via traceroute-like methods have led some to conclude that the router graph of the Internet is well modeled as a power-law random graph. In such a graph, the degree distribution of nodes follows a distribution with a power-law tail. We argue that the evidence to date for this conclusion is at best insufficient We show that when graphs are sampled using traceroute-like methods, the resulting degree distribution can differ sharply from that of the underlying graph. For example, given a sparse Erdos-Renyi random graph, the subgraph formed by a collection of shortest paths from a small set of random sources to a larger set of random destinations can exhibit a degree distribution remarkably like a power-law. We explore the reasons for how this effect arises, and show that in such a setting, edges are sampled in a highly biased manner. This insight allows us to formulate tests for determining when sampling bias is present. When we apply these tests to a number of well-known datasets, we find strong evidence for sampling bias.

Journal ArticleDOI
TL;DR: A novel web tool for the statistical analysis of gene expression data in multiple tag sampling experiments, using six different test statistics to detectially expressed genes.
Abstract: Here we present a novel web tool for the statistical analysis of gene expression data in multiple tag sampling experiments. Differentially expressed genes are detected by using six different test s...

Journal ArticleDOI
TL;DR: These devices are part of an emerging strategy for monitoring exposure to hydrophobic organic chemicals and are designed to provide real-time information about human exposure to these chemicals.
Abstract: These devices are part of an emerging strategy for monitoring exposure to hydrophobic organic chemicals.

Journal ArticleDOI
TL;DR: The conditions under which importance sampling is applicable in high dimensions are investigated and it is found that importance sampling densities using design points are applicable if the covariance matrix associated with each design point does not deviate significantly from the identity matrix.

Proceedings ArticleDOI
25 Aug 2003
TL;DR: This paper provides methods that use flow statistics formed from sampled packet stream to infer the frequencies of the number of packets per flow in the unsampled stream, and by exploiting protocol level detail reported in flow records.
Abstract: Passive traffic measurement increasingly employs sampling at the packet level. Many high-end routers form flow statistics from a sampled substream of packets. Sampling is necessary in order to control the consumption of resources by the measurement operations. However, knowledge of the statistics of flows in the unsampled stream remains useful, for understanding both characteristics of source traffic, and consumption of resources in the network.This paper provide methods that use flow statistics formed from sampled packet stream to infer the absolute frequencies of lengths of flows in the unsampled stream. A key part of our work is inferring the numbers and lengths of flows of original traffic that evaded sampling altogether. We achieve this through statistical inference, and by exploiting protocol level detail reported in flow records. The method has applications to detection and characterization of network attacks: we show how to estimate, from sampled flow statistics, the number of compromised hosts that are sending attack traffic past the measurement point. We also investigate the impact on our results of different implementations of packet sampling.

Journal ArticleDOI
TL;DR: In this article, the authors evaluated the effect of data variability and the strength of spatial correlation in the data on the performance of grid soil sampling of different sampling density and two interpolation procedures, ordinary point kriging and optimal inverse distance weighting (IDW).
Abstract: Effectiveness of precision agriculture depends on accurate and efficient mapping of soil properties. Among the factors that most affect soil property mapping are the number of soil samples, the distance between sampling locations, and the choice of interpolation procedures. The objective of this study is to evaluate the effect of data variability and the strength of spatial correlation in the data on the performance of (i) grid soil sampling of different sampling density and (ii) two interpolation procedures, ordinary point kriging and optimal inverse distance weighting (IDW). Soil properties with coefficients of variation (CV) ranging from 12 to 67% were sampled in a 20-ha field using a regular grid with a 30-m distance between grid points. Data sets with different spatial structures were simulated based on the soil sample data using a simulated annealing procedure. The strength of simulated spatial structures ranged from weak with nugget to sill (N/S) ratio of 0.6 to strong (N/S ratio of 0.1). The results indicated that regardless of CV values, soil properties with a strong spatial structure were mapped more accurately than those that had weak spatial structure. Kriging with known variogram parameters performed significantly better than the IDW for most of the studied cases (P < 0.01). However, when variogram parameters were determined from sample variograms, kriging was as accurate as the IDW only for sufficiently large data sets, but was less precise when a reliable sample variogram could not be obtained from the data.

Book ChapterDOI
14 Apr 2003
TL;DR: A straightforward active learning heuristic, representative sampling, is described, which explores the clustering structure of 'uncertain' documents and identifies the representative samples to query the user opinions, for the purpose of speeding up the convergence of Support Vector Machine (SVM) classifiers.
Abstract: In order to reduce human efforts, there has been increasing interest in applying active learning for training text classifiers. This paper describes a straightforward active learning heuristic, representative sampling, which explores the clustering structure of 'uncertain' documents and identifies the representative samples to query the user opinions, for the purpose of speeding up the convergence of Support Vector Machine (SVM) classifiers. Compared with other active learning algorithms, the proposed representative sampling explicitly addresses the problem of selecting more than one unlabeled documents. In an empirical study we compared representative sampling both with random sampling and with SVM active learning. The results demonstrated that representative sampling offers excellent learning performance with fewer labeled documents and thus can reduce human efforts in text classification tasks.

Journal ArticleDOI
TL;DR: The Continuous Plankton Recorder (CPR) has been deployed for 70 years and has been used to sample plankton from the seafloor of the European continental shelf as mentioned in this paper.

Book
03 Nov 2003
TL;DR: In this article, the authors propose a balanced ranked set sampling method for distribution-free tests with Ranked Set Sampling, which is based on nonparametric and parametric set sampling.
Abstract: 1 Introduction.- 2 Balanced Ranked Set Sampling I: Nonparametric.- 3 Balanced Ranked Set Sampling II: Parametric.- 4 Unbalanced Ranked Set Sampling and Optimal Designs.- 5 Distribution-Free Tests with Ranked Set Sampling.- 6 Ranked Set Sampling with Concomitant Variables.- 7 Ranked Set Sampling as Data Reduction Tools.- 8 Case Studies.- References.


Journal ArticleDOI
TL;DR: In this paper, a general framework for sampling and reconstruction procedures based on a consistency requirement was introduced, which allows for almost arbitrary sampling and reconstruction spaces, as well as arbitrary input signals.
Abstract: This article introduces a general framework for sampling and reconstruction procedures based on a consistency requirement, introduced by Unser and Aldroubi in [29]. The procedures we develop allow for almost arbitrary sampling and reconstruction spaces, as well as arbitrary input signals. We first derive a nonredundant sampling procedure. We then introduce the new concept of oblique dual frame vectors, that lead to frame expansions in which the analysis and synthesis frame vectors are not constrained to lie in the same space. Based on this notion, we develop a redundant sampling procedure that can be used to reduce the quantization error when quantizing the measurements prior to reconstruction.

Proceedings ArticleDOI
01 Jul 2003
TL;DR: This work introduces structured importance sampling, a new technique for efficiently rendering scenes illuminated by distant natural illumination given in an environment map, and presents a novel hierarchical stratification algorithm that uses the authors' metric to automatically stratify the environment map into regular strata.
Abstract: We introduce structured importance sampling, a new technique for efficiently rendering scenes illuminated by distant natural illumination given in an environment map. Our method handles occlusion, high-frequency lighting, and is significantly faster than alternative methods based on Monte Carlo sampling. We achieve this speedup as a result of several ideas. First, we present a new metric for stratifying and sampling an environment map taking into account both the illumination intensity as well as the expected variance due to occlusion within the scene. We then present a novel hierarchical stratification algorithm that uses our metric to automatically stratify the environment map into regular strata. This approach enables a number of rendering optimizations, such as pre-integrating the illumination within each stratum to eliminate noise at the cost of adding bias, and sorting the strata to reduce the number of sample rays. We have rendered several scenes illuminated by natural lighting, and our results indicate that structured importance sampling is better than the best previous Monte Carlo techniques, requiring one to two orders of magnitude fewer samples for the same image quality.

Journal ArticleDOI
TL;DR: In this article, a framework for determining a sampling approach in international studies is proposed, based on an assessment of the way in which sampling affects the validity of research results, and shows how different research objectives impact upon the desired sampling method and the desired sample characteristics.
Abstract: Sampling in the international environment needs to satisfy the same requirements as sampling in the domestic environment, but there are additional issues to consider, such as the need to balance within-country representativeness with cross-national comparability. However, most international marketing research studies fail to provide theoretical justification for their choice of sampling approach. This is because research design theory and sampling theory have not been well integrated in the context of international research. This paper seeks to fill the gap by developing a framework for determining a sampling approach in international studies. The framework is based on an assessment of the way in which sampling affects the validity of research results, and shows how different research objectives impact upon (a) the desired sampling method and (b) the desired sample characteristics. The aim is to provide researchers with operational guidance in choosing a sampling approach that is theoretically appropriate to their particular research aims.

Journal ArticleDOI
TL;DR: In this paper, two general classes of density estimation models have been developed: models that use data sets from capture-recapture or removal sampling techniques (often derived from trapping grids) from which separate estimates of population size (N) and effective sampling area (Â) are used to calculate density (D = N/Â), and models applicable to sampling regimes using distance-sampling theory (typically transect lines or trapping webs) to estimate detection functions and densities directly from the distance data.
Abstract: Statistical models for estimating absolute densities of field populations of animals have been widely used over the last century in both scientific studies and wildlife management programs. To date, two general classes of density estimation models have been developed: models that use data sets from capture–recapture or removal sampling techniques (often derived from trapping grids) from which separate estimates of population size (N) and effective sampling area (Â) are used to calculate density (D = N/Â); and models applicable to sampling regimes using distance-sampling theory (typically transect lines or trapping webs) to estimate detection functions and densities directly from the distance data. However, few studies have evaluated these respective models for accuracy, precision, and bias on known field populations, and no studies have been conducted that compare the two approaches under controlled field conditions. In this study, we evaluated both classes of density estimators on known densities of e...