scispace - formally typeset
Search or ask a question

Showing papers on "Sampling (statistics) published in 2004"


Journal ArticleDOI
TL;DR: A linear dynamic range over 2 orders of magnitude is demonstrated by using the number of spectra (spectral sampling) acquired for each protein by the data-dependent acquisition of peptides eluting into the mass spectrometer.
Abstract: Proteomic analysis of complex protein mixtures using proteolytic digestion and liquid chromatography in combination with tandem mass spectrometry is a standard approach in biological studies. Data-dependent acquisition is used to automatically acquire tandem mass spectra of peptides eluting into the mass spectrometer. In more complicated mixtures, for example, whole cell lysates, data-dependent acquisition incompletely samples among the peptide ions present rather than acquiring tandem mass spectra for all ions available. We analyzed the sampling process and developed a statistical model to accurately predict the level of sampling expected for mixtures of a specific complexity. The model also predicts how many analyses are required for saturated sampling of a complex protein mixture. For a yeast-soluble cell lysate 10 analyses are required to reach a 95% saturation level on protein identifications based on our model. The statistical model also suggests a relationship between the level of sampling observed for a protein and the relative abundance of the protein in the mixture. We demonstrate a linear dynamic range over 2 orders of magnitude by using the number of spectra (spectral sampling) acquired for each protein.

2,506 citations


Journal ArticleDOI
TL;DR: Software for analysing complex survey samples in R using the sampling scheme can be explicitly described or represented by replication weights as well as linearisation.
Abstract: I present software for analysing complex survey samples in R. The sampling scheme can be explicitly described or represented by replication weights. Variance estimation uses either replication or linearisation.

1,786 citations


Journal ArticleDOI
TL;DR: This paper develops a sampling and estimation technique called respondent-driven sampling, which allows researchers to make asymptotically unbiased estimates about the characteristics of hidden populations such as injection drug users, the homeless, and artists.
Abstract: Standard statistical methods often provide no way to make accurate estimates about the characteristics of hidden populations such as injection drug users, the homeless, and artists. In this paper, we further develop a sampling and estimation technique called respondent-driven sampling, which allows researchers to make asymptotically unbiased estimates about these hidden populations. The sample is selected with a snowball-type design that can be done more cheaply, quickly, and easily than other methods currently in use. Further, we can show that under certain specified (and quite general) conditions, our estimates for the percentage of the population with a specific trait are asymptotically unbiased. We further show that these estimates are asymptotically unbiased no matter how the seeds are selected. We conclude with a comparison of respondent-driven samples of jazz musicians in New York and San Francisco, with corresponding institutional samples of jazz musicians from these cities. The results show that ...

1,744 citations


Journal ArticleDOI
TL;DR: Simulated data sets were used to test the power and accuracy of Monte Carlo resampling methods in generating statistical thresholds for identifying F0 immigrants in populations with ongoing gene flow, and hence for providing direct, real‐time estimates of migration rates.
Abstract: Genetic assignment methods use genotype likelihoods to draw inference about where individuals were or were not born, potentially allowing direct, real-time estimates of dispersal. We used simulated data sets to test the power and accuracy of Monte Carlo resampling methods in generating statistical thresholds for identifying F0 immigrants in populations with ongoing gene flow, and hence for providing direct, real-time estimates of migration rates. The identification of accurate critical values required that resampling methods preserved the linkage disequilibrium deriving from recent generations of immigrants and reflected the sampling variance present in the data set being analysed. A novel Monte Carlo resampling method taking into account these aspects was proposed and its efficiency was evaluated. Power and error were relatively insensitive to the frequency assumed for missing alleles. Power to identify F0 immigrants was improved by using large sample size (up to about 50 individuals) and by sampling all populations from which migrants may have originated. A combination of plotting genotype likelihoods and calculating mean genotype likelihood ratios (DLR) appeared to be an effective way to predict whether F0 immigrants could be identified for a particular pair of populations using a given set of markers.

1,481 citations


Journal ArticleDOI
TL;DR: In this paper, a unified strategy for selecting spatially balanced probability samples of natural resources is presented, which is based on creating a function that maps two-dimensional space into onedimensional space, thereby defining an ordered spatial address.
Abstract: The spatial distribution of a natural resource is an important consideration in designing an efficient survey or monitoring program for the resource. Generally, sample sites that are spatially balanced, that is, more or less evenly dispersed over the extent of the resource, are more efficient than simple random sampling. We review a unified strategy for selecting spatially balanced probability samples of natural resources. The technique is based on creating a function that maps two-dimensional space into one-dimensional space, thereby defining an ordered spatial address. We use a restricted randomization to randomly order the addresses, so that systematic sampling along the randomly ordered linear structure results in a spatially well-balanced random sample. Variable inclusion probability, proportional to an arbitrary positive ancillary variable, is easily accommodated. The basic technique selects points in a two-dimensional continuum, but is also applicable to sampling finite populations or one-dimension...

1,082 citations


Journal ArticleDOI
TL;DR: The results challenge the recently proposed notion that a set of six icosahedrally‐arranged orientations is optimal for DT‐MRI and show that at least 20 unique samplingorientations are necessary for a robust estimation of anisotropy, whereas at least 30 unique sampling orientations are required for a strong estimation of tensor‐orientation and mean diffusivity.
Abstract: There are conflicting opinions in the literature as to whether it is more beneficial to use a large number of gradient sampling orientations in diffusion tensor MRI (DT-MRI) experiments than to use a smaller number of carefully chosen orientations. In this study, Monte Carlo simulations were used to study the effect of using different gradient sampling schemes on estimates of tensor-derived quantities assuming a b-value of 1000 smm –2 . The study focused in particular on the effect that the number of unique gradient orientations has on uncertainty in estimates of tensor-orientation, and on estimates of the trace and anisotropy of the diffusion tensor. The results challenge the recently proposed notion that a set of six icosahedrally-arranged orientations is optimal for DT-MRI. It is shown that at least 20 unique sampling orientations are necessary for a robust estimation of anisotropy, whereas at least 30 unique sampling orientations are required for a robust estimation of tensor-orientation and mean diffusivity. Finally, the performance of sampling schemes that use low numbers of sampling orientations, but make efficient use of available gradient power, are compared to less efficient schemes with larger numbers of sampling orientations, and the relevant scenarios in which each type of scheme should be used are discussed. Magn Reson Med 51:807– 815, 2004. Published 2004 Wiley-Liss, Inc.†

824 citations


Journal ArticleDOI
TL;DR: It is suggested that for many sampling situations, relationships between probability of detection and habitat covariates need to be established to correctly interpret results of wildlife–habitat models.

749 citations


Journal ArticleDOI
TL;DR: In this paper, the authors present a review of the application and interpretation of Logistic Regression under three sampling designs: random, case-control, and use-availability, for habitat use-nonuse studies employing random sampling and can be used to directly model the conditional probability of use.
Abstract: Logistic regression is an important tool for wildlife habitat-selection studies, but the method frequently has been misapplied due to an inadequate understanding of the logistic model, its interpretation, and the influence of sampling design. To promote better use of this method, we review its application and interpretation under 3 sampling designs: random, case–control, and use–availability. Logistic regression is appropriate for habitat use–nonuse studies employing random sampling and can be used to directly model the conditional probability of use in such cases. Logistic regression also is appropriate for studies employing case–control sampling designs, but careful attention is required to interpret results correctly. Unless bias can be estimated or probability of use is small for all habitats, results of case–control studies should be interpreted as odds ratios, rather than probability of use or relative probability of use. When data are gathered under a use–availability design, logistic regr...

568 citations



Journal ArticleDOI
TL;DR: In this article, the authors examined the influence of range size on the sample size and sampling prevalence of data used to train and test distribution models for 32 bird species endemic to South Africa, Lesotho and Swaziland.
Abstract: Summary 1 Conservation scientists and resource managers increasingly employ empirical distribution models to aid decision-making. However, such models are not equally reliable for all species, and range size can affect their performance. We examined to what extent this effect reflects statistical artefacts arising from the influence of range size on the sample size and sampling prevalence (proportion of samples representing species presence) of data used to train and test models. 2 Our analyses used both simulated data and empirical distribution models for 32 bird species endemic to South Africa, Lesotho and Swaziland. Models were built with either logistic regression or non-linear discriminant analysis, and assessed with four measures of model accuracy: sensitivity, specificity, Cohen's kappa and the area under the curve (AUC) of receiver-operating characteristic (ROC) plots. Environmental indices derived from Fourier-processed satellite imagery served as predictors. 3 We first followed conventional modelling practice to illustrate how range size might influence model performance, when sampling prevalence reflects species’ natural prevalences. We then demonstrated that this influence is primarily artefactual. Statistical artefacts can arise during model assessment, because Cohen's kappa responds systematically to changes in prevalence. AUC, in contrast, is largely unaffected, and thus a more reliable measure of model performance. Statistical artefacts also arise during model fitting. Both logistic regression and discriminant analysis are sensitive to the sample size and sampling prevalence of training data. Both perform best when sample size is large and prevalence intermediate. 4 Synthesis and applications. Species’ ecological characteristics may influence the performance of distribution models. Statistical artefacts, however, can confound results in comparative studies seeking to identify these characteristics. To mitigate artefactual effects, we recommend careful reporting of sampling prevalence, AUC as the measure of accuracy, and fixed, intermediate levels of sampling prevalence in comparative studies.

521 citations


Posted Content
TL;DR: This work introduces Mixed Data Sampling regression models, which involve time series data sampled at different frequencies and have wide applicability in macroeconomics andnance.
Abstract: We introduce Mixed Data Sampling (henceforth MIDAS) regression models. The regressions involve time series data sampled at dieren t frequencies. Technically speaking MIDAS models specify conditional expectations as a distributed lag of regressors recorded at some higher sampling frequencies. We examine the asymptotic properties of MIDAS regression estimation and compare it with traditional distributed lag models. MIDAS regressions have wide applicability in macroeconomics and nance.


01 Jan 2004
TL;DR: 1. Introduction to advanced distance sampling 2. General formulation for distance sampling 3. Covariate models 4. Spatial distance sampling models 5. Temporal inferences from distance sampling surveys 6. Methods for incomplete detection at distance zero.
Abstract: 1. Introduction to advanced distance sampling 2. General formulation for distance sampling 3. Covariate models 4. Spatial distance sampling models 5. Temporal inferences from distance sampling surveys 6. Methods for incomplete detection at distance zero 7. Design of distance sampling surveys and Geographic Information Systems 8. Adaptive distance sampling surveys 9. Passive approaches to detection in distance sampling 10. Assessment of distance sampling estimators 11. Further topics in distance sampling References Index

Journal ArticleDOI
TL;DR: Anthony Tuckett analyses a research experience, together with the rationales for and limitations of qualitative research sampling, and examines the reality of establishing and maintaining a purposeful/theoretical sample and how data saturation symbiotically interacts with constant comparison to guide sampling.
Abstract: In this article Anthony Tuckett discusses the complexities of qualitative research sampling. He analyses a research experience, together with the rationales for and limitations of qualitative research sampling. Further, he examines the reality of establishing and maintaining a purposeful/theoretical sample and how data saturation symbiotically interacts with constant comparison to guide sampling. Additionally, sample limitations are countered. This paper is aimed at novice and experienced researchers in nursing interested in the practical reality of research, who are also mindful of the necessity for rigour.

Journal ArticleDOI
TL;DR: In this paper, the authors applied a new method to estimate proportion of area occupied using detection/nondetection data from a terrestrial salamander system in Great Smoky Mountains National Park.
Abstract: Recent, worldwide amphibian declines have highlighted a need for more extensive and rigorous monitoring programs to document species occurrence and detect population change. Abundance estimation methods, such as mark–recapture, are often expensive and impractical for large-scale or long-term amphibian monitoring. We apply a new method to estimate proportion of area occupied using detection/nondetection data from a terrestrial salamander system in Great Smoky Mountains National Park. Estimated species-specific detection probabilities were all <1 and varied among seven species and four sampling methods. Time (i.e., sampling occasion) and four large-scale habitat characteristics (previous disturbance history, vegetation type, elevation, and stream presence) were important covariates in estimates of both proportion of area occupied and detection probability. All sampling methods were consistent in their ability to identify important covariates for each salamander species. We believe proportion of area occupie...

Book
01 Jan 2004
TL;DR: Sampling Rare or Elusive Species describes the latest sampling designs and survey methods for reliably estimating occupancy, abundance, and other population parameters of rare, elusive, or otherwise hard-to-detect plants and animals.
Abstract: Information regarding population status and abundance of rare species plays a key role in resource management decisions. Ideally, data should be collected using statistically sound sampling methods, but by their very nature, rare or elusive species pose a difficult sampling challenge. Sampling Rare or Elusive Species describes the latest sampling designs and survey methods for reliably estimating occupancy, abundance, and other population parameters of rare, elusive, or otherwise hard-to-detect plants and animals. It offers a mixture of theory and application, with actual examples from terrestrial, aquatic, and marine habitats around the world. Sampling Rare or Elusive Species is the first volume devoted entirely to this topic and provides natural resource professionals with a suite of innovative approaches to gathering population status and trend data. It represents an invaluable reference for natural resource professionals around the world, including fish and wildlife biologists, ecologists, biometricians, natural resource managers, and all others whose work or research involves rare or elusive species.

Journal ArticleDOI
TL;DR: In this article, the effects of bias in resource selection functions (RSF) and compared the effectiveness of two bias-correction techniques (sample weighting and iterative simulation) were investigated.
Abstract: Summary 1. Compared to traditional radio-collars, global positioning system (GPS) collars provide finer spatial resolution and collect locations across a broader range of spatial and temporal conditions. However, data from GPS collars are biased because vegetation and terrain interfere with the satellite signals necessary to acquire a location. Analyses of habitat selection generally proceed without correcting for this known sampling bias. We documented the effects of bias in resource selection functions (RSF) and compared the effectiveness of two bias-correction techniques. 2. The effects of environmental conditions on the probability of a GPS collar collecting a location were modelled for three brands of collar using data collected in 24-h trials at 194 test locations. The best-supported model was used to create GPS-biased data from unbiased animal locations. These data were used to assess the effects of bias given data losses in the range of 10‐40% at both 1- and 6-h sampling intensities. We compared the sign, value and significance of coefficients derived using biased and unbiased data. 3. With 6-h locations we observed type II error rates of 30‐40% given as little as a 10% data loss. Biased data also produced coefficients that were significantly more negative than unbiased estimates. Increasing the sampling intensity from 6- to 1-h locations eliminated type II errors but increased the magnitude of coefficient bias. No type I errors or changes in sign were observed. 4. We applied sample weighting and iterative simulation given a 30% data loss. For a biased vegetation type, simulation reduced more type II errors than weighting, most probably because the original sample size was re-established. However, selection for areas near trails, which was influenced by a biased vegetation type, showed fewer type II errors after weighting existing animal locations than after simulation. Both techniques corrected 100% and ≥ 80% of the biased coefficients at the 6- and 1-h sampling intensities, respectively. 5. Synthesis and applications. This study demonstrates that GPS error is predictable and biases the coefficients of resource selection models dependant upon the GPS sampling intensity and the level of data loss. We provide effective alternatives for correcting bias and discuss applying corrections under different sampling designs.

Book
15 Apr 2004
TL;DR: Sampling Strategies for Natural Resources and the Environment as mentioned in this paper covers the sampling techniques used in ecology, forestry, environmental science, and natural resources and presents methods to estimate aggregate characteristics on a per unit area basis as well as on an elemental basis.
Abstract: Written by renowned experts in the field, Sampling Strategies for Natural Resources and the Environment covers the sampling techniques used in ecology, forestry, environmental science, and natural resources The book presents methods to estimate aggregate characteristics on a per unit area basis as well as on an elemental basis In addition to comm

Proceedings ArticleDOI
06 Jul 2004
TL;DR: This paper examines some of the implementation issues in rigid body path planning and presents techniques which have been found to be effective experimentally for Rigid Body path planning.
Abstract: Important implementation issues in rigid body path planning are often overlooked. In particular, sampling-based motion planning algorithms typically require a distance metric defined on the configuration space, a sampling function, and a method for interpolating sampled points. The configuration space of a 3D rigid body is identified with the Lie group SE(3). Defining proper metrics, sampling, and interpolation techniques for SE(3) is not obvious, and can become a hidden source of failure for many planning algorithm implementations. This paper examines some of these issues and presents techniques which have been found to be effective experimentally for Rigid Body path planning.

Journal ArticleDOI
TL;DR: An implementation of umbrella sampling in which the pertinent range of states is subdivided into small windows that are sampled consecutively and linked together is considered, which is comparable to a multicanonical simulation with a very good weight function.
Abstract: We consider an implementation of umbrella sampling in which the pertinent range of states is subdivided into small windows that are sampled consecutively and linked together. This allows us to simulate without a weight function or to extrapolate the results to the neighboring window in order to estimate a weight function. Additionally, we present a detailed error analysis in which we demonstrate that the error in umbrella sampling is controlled and, in the absence of sampling difficulties, independent of the window sizes. In this case, the efficiency of our implementation is comparable to a multicanonical simulation with a very good weight function, which in our scheme does not need to be known ahead of time. The analysis also allows us to detect sampling difficulties such as correlations between adjacent windows and provides a test of equilibration. We exemplify the scheme by simulating the liquid–vapor coexistence in a Lennard-Jones system.

Posted Content
TL;DR: Considering the similarities and unique characteristics of online file sharing and software piracy, here are some ideas to consider.
Abstract: Considering the similarities and unique characteristics of online file sharing and software piracy.

Journal ArticleDOI
TL;DR: Non-uniform sampling is shown to provide significant time savings in the acquisition of a suite of three-dimensional NMR experiments utilized for obtaining backbone assignments of H, N, C', CA, and CB nuclei in proteins and will help extend the size limit of proteins accessible to NMR studies, and open the way to studies of samples that suffer from solubility problems.

Journal ArticleDOI
TL;DR: This study proposes ratio estimators by adapting the estimators' type of Ray and Singh to traditional and the other ratio-type estimators in simple random sampling in literature and finds the conditions, which make each proposed estimator more efficient than the others.

Proceedings ArticleDOI
26 Apr 2004
TL;DR: Back-casting operates by first having a small subset of the wireless sensors communicate their information to a fusion center, which provides an initial estimate of the environment being sensed, and guides the allocation of additional network resources.
Abstract: Wireless sensor networks provide an attractive approach to spatially monitoring environments. Wireless technology makes these systems relatively flexible, but also places heavy demands on energy consumption for communications. This raises a fundamental trade-off: using higher densities of sensors provides more measurements, higher resolution and better accuracy, but requires more communications and processing. This paper proposes a new approach, called "back-casting," which can significantly reduce communications and energy consumption while maintaining high accuracy. Back-casting operates by first having a small subset of the wireless sensors communicate their information to a fusion center. This provides an initial estimate of the environment being sensed, and guides the allocation of additional network resources. Specifically, the fusion center backcasts information based on the initial estimate to the network at large, selectively activating additional sensor nodes in order to achieve a target error level. The key idea is that the initial estimate can detect correlations in the environment, indicating that many sensors may not need to be activated by the fusion center. Thus, adaptive sampling can save energy compared to dense, non-adaptive sampling. This method is theoretically analyzed in the context of field estimation and it is shown that the energy savings can be quite significant compared to conventional approaches. For example, when sensing a piecewise smooth field with an array of 100 /spl times/ 100 sensors, adaptive sampling can reduce the energy consumption by roughly a factor of 10 while providing the same accuracy achievable if all sensors were activated.

Journal ArticleDOI
TL;DR: The spatial structure of EMF community data from a number of studies carried out in seven mature and one recently fire-initiated forest stand indicated that in four of eight sites community similarity decreased with distance, whereas Mantel correlograms found spatial autocorrelation in those four plus two additional sites.

Journal ArticleDOI
TL;DR: In this article, synthetic error fields were imposed on an observation-based 1/2° latitude/longitude gridded precipitation data set to assess the effect of this error on simulated hydrological fluxes and states.
Abstract: [1] Precipitation is the single most important determinant of the fluxes and states of the land surface hydrological system and the most important atmospheric input to hydrological models. Satellite-based precipitation estimates, such as those anticipated from the Global Precipitation Measurement (GPM) satellites, hold great promise for application in hydrologic simulation and prediction, especially in parts of the world where surface observation networks are sparse. However, the usefulness of these precipitation products for hydrological applications will depend on their error characteristics. Of particular interest in satellite-derived precipitation estimates is the sampling error, that is, the error in accumulated precipitation due to periodic sampling of the precipitation rate. To assess the effect of this error on simulated hydrological fluxes and states, synthetic error fields were imposed on an observation-based 1/2° latitude/longitude gridded precipitation data set. In turn, the generated precipitation fields were used as input to a macroscale hydrology model (MHM). Our results show that (1) streamflow errors were large for small drainage areas but decreased rapidly for drainage areas larger than about 50,000 km2. Much of the streamflow error is associated with fast (near-surface) runoff response. (2) Streamflow estimates were biased upward due to sampling errors, with the bias increasing with sampling interval and with drainage area. Evapotranspiration was biased downward in a compensating amount. (3) Spatial correlation of precipitation errors reduced the rate at which errors decreased with drainage area for all variables investigated, but the differences between the correlated and uncorrelated error cases were smaller for streamflow and evapotranspiration than for precipitation.

Journal ArticleDOI
TL;DR: In this paper, the authors summarized the most recent literature on the best practices of Web survey implementation and offered practical advice for researchers to implement web survey implementation, and summarized the survey best practices.
Abstract: This chapter summarizes the most recent literature on the best practices of Web survey implementation and offers practical advice for researchers.

Proceedings ArticleDOI
10 Mar 2004
TL;DR: This paper presents StatCache, a novel sampling-based method for performing data-locality analysis on realistic workloads, based on a probabilistic model of the cache, rather than a functional cache simulator.
Abstract: The widening memory gap reduces performance of applications with poor data locality. Therefore, there is a need for methods to analyze data locality and help application optimization. In this paper we present StatCache, a novel sampling-based method for performing data-locality analysis on realistic workloads. StatCache is based on a probabilistic model of the cache, rather than a functional cache simulator. It uses statistics from a single run to accurately estimate miss ratios of fully-associative caches of arbitrary sizes and generate working-set graphs. We evaluate StatCache using the SPEC CPU2000 benchmarks and show that StatCache gives accurate results with a sampling rate as low as 10/sup -4/. We also provide a proof-of-concept implementation, and discuss potentially very fast implementation alternatives.

Journal ArticleDOI
TL;DR: The article is based on the experience of the European RITA and American NACAD working groups and is an extended revision of trimming guides published in 1995, with the optimum localization for tissue preparation, the sample size, the direction of sectioning and the number of sections to be prepared described organ by organ.

Journal ArticleDOI
TL;DR: In this article, the authors applied the discrete path sampling approach to analyze the dynamics of several atomic and molecular clusters and calculated permutational isomerization rates for icosahedral atomic clusters containing 13 and 55 atoms.
Abstract: The discrete path sampling approach is applied to analyse the dynamics of several atomic and molecular clusters. Permutational isomerization rates are first calculated for icosahedral atomic clusters containing 13 and 55 atoms. The transformation between decahedral and icosahedral morphologies of a 75-atom cluster is then investigated, for which the potential energy surface has double funnel character. The final system considered is a cluster of twenty water molecules treated using a rigid molecule pair potential. Detailed analysis of the database of stationary points produced by the initial sampling is used to investigate the accuracy of the two-state description in each case. A clear deviation from two-state behaviour occurs for (H2O)20, where low-lying intervening minima exist.