scispace - formally typeset
Search or ask a question

Showing papers on "Sampling (statistics) published in 2001"


Journal ArticleDOI
TL;DR: A series of common pitfalls in quantifying and comparing taxon richness are surveyed, including category‐subcategory ratios (species-to-genus and species-toindividual ratios) and rarefaction methods, which allow for meaningful standardization and comparison of datasets.
Abstract: Species richness is a fundamental measurement of community and regional diversity, and it underlies many ecological models and conservation strategies. In spite of its importance, ecologists have not always appreciated the effects of abundance and sampling effort on richness measures and comparisons. We survey a series of common pitfalls in quantifying and comparing taxon richness. These pitfalls can be largely avoided by using accumulation and rarefaction curves, which may be based on either individuals or samples. These taxon sampling curves contain the basic information for valid richness comparisons, including category‐subcategory ratios (species-to-genus and species-toindividual ratios). Rarefaction methods ‐ both sample-based and individual-based ‐ allow for meaningful standardization and comparison of datasets. Standardizing data sets by area or sampling effort may produce very different results compared to standardizing by number of individuals collected, and it is not always clear which measure of diversity is more appropriate. Asymptotic richness estimators provide lower-bound estimates for taxon-rich groups such as tropical arthropods, in which observed richness rarely reaches an asymptote, despite intensive sampling. Recent examples of diversity studies of tropical trees, stream invertebrates, and herbaceous plants emphasize the importance of carefully quantifying species richness using taxon sampling curves.

5,706 citations


Proceedings Article
28 Jun 2001
TL;DR: This paper presents an active learning method that directly optimizes expected future error, in contrast to many other popular techniques that instead aim to reduce version space size.
Abstract: This paper presents an active learning method that directly optimizes expected future error. This is in contrast to many other popular techniques that instead aim to reduce version space size. These other methods are popular because for many learning models, closed form calculation of the expected future error is intractable. Our approach is made feasible by taking a sampling approach to estimating the expected reduction in error due to the labeling of a query. In experimental results on two real-world data sets we reach high accuracy very quickly, sometimes with four times fewer labeled examples than competing methods.

929 citations


Journal ArticleDOI
TL;DR: It is hypothesized that lack of fit of the model in the population will not, on the average, influence recovery of population factors in analysis of sample data, regardless of degree of model error and regardless of sample size.
Abstract: This article examines effects of sample size and other design features on correspondence between factors obtained from analysis of sample data and those present in the population from which the samples were drawn. We extend earlier work on this question by examining these phenomena in the situation in which the common factor model does not hold exactly in the population. We present a theoretical framework for representing such lack of fit and examine its implications in the population and sample. Based on this approach we hypothesize that lack of fit of the model in the population will not, on the average, influence recovery of population factors in analysis of sample data, regardless of degree of model error and regardless of sample size. Rather, such recovery will be affected only by phenomena related to sampling error which have been studied previously. These hypotheses are investigated and verified in two sampling studies, one using artificial data and one using empirical data.

901 citations


Journal ArticleDOI
TL;DR: A general formula for the density of a vine dependent distribution is derived, which generalizes the well-known density formula for belief nets based on the decomposition of belief nets into cliques and allows a simple proof of the Information Decomposition Theorem for a regular vine.
Abstract: A vine is a new graphical model for dependent random variables Vines generalize the Markov trees often used in modeling multivariate distributions They differ from Markov trees and Bayesian belief nets in that the concept of conditional independence is weakened to allow for various forms of conditional dependence A general formula for the density of a vine dependent distribution is derived This generalizes the well-known density formula for belief nets based on the decomposition of belief nets into cliques Furthermore, the formula allows a simple proof of the Information Decomposition Theorem for a regular vine The problem of (conditional) sampling is discussed, and Gibbs sampling is proposed to carry out sampling from conditional vine dependent distributions The so-called ‘canonical vines’ built on highest degree trees offer the most efficient structure for Gibbs sampling

836 citations


Journal ArticleDOI
TL;DR: This work proposes a new technique for tracking moving target distributions, known as particle filters, which does not suffer from a progressive degeneration as the target sequence evolves.
Abstract: Markov chain Monte Carlo (MCMC) sampling is a numerically intensive simulation technique which has greatly improved the practicality of Bayesian inference and prediction. However, MCMC sampling is too slow to be of practical use in problems involving a large number of posterior (target) distributions, as in dynamic modelling and predictive model selection. Alternative simulation techniques for tracking moving target distributions, known as particle filters, which combine importance sampling, importance resampling and MCMC sampling, tend to suffer from a progressive degeneration as the target sequence evolves. We propose a new technique, based on these same simulation methodologies, which does not suffer from this progressive degeneration.

828 citations


Journal ArticleDOI
TL;DR: A unified framework for uniform and nonuniform sampling and reconstruction in shift-invariant subspaces is provided by bringing together wavelet theory, frame theory, reproducing kernel Hilbert spaces, approximation theory, amalgam spaces, and sampling.
Abstract: This article discusses modern techniques for nonuniform sampling and reconstruction of functions in shift-invariant spaces. It is a survey as well as a research paper and provides a unified framework for uniform and nonuniform sampling and reconstruction in shift-invariant subspaces by bringing together wavelet theory, frame theory, reproducing kernel Hilbert spaces, approximation theory, amalgam spaces, and sampling. Inspired by applications taken from communication, astronomy, and medicine, the following aspects will be emphasized: (a) The sampling problem is well defined within the setting of shift-invariant spaces. (b) The general theory works in arbitrary dimension and for a broad class of generators. (c) The reconstruction of a function from any sufficiently dense nonuniform sampling set is obtained by efficient iterative algorithms. These algorithms converge geometrically and are robust in the presence of noise. (d) To model the natural decay conditions of real signals and images, the sampling theory is developed in weighted L p-spaces.

762 citations


Journal ArticleDOI
Lin Liang1, Ce Liu1, Ying-Qing Xu1, Baining Guo1, Heung-Yeung Shum1 
TL;DR: An algorithm for synthesizing textures from an input sample by sampling patches according to a nonparametric estimation of the local conditional MRF density function, to avoid mismatching features across patch boundaries.
Abstract: We present an algorithm for synthesizing textures from an input sample. This patch-based sampling algorithm is fast and it makes high-quality texture synthesis a real-time process. For generating textures of the same size and comparable quality, patch-based sampling is orders of magnitude faster than existing algorithms. The patch-based sampling algorithm works well for a wide variety of textures ranging from regular to stochastic. By sampling patches according to a nonparametric estimation of the local conditional MRF density function, we avoid mismatching features across patch boundaries. We also experimented with documented cases for which pixel-based nonparametric sampling algorithms cease to be effective but our algorithm continues to work well.

731 citations



Journal ArticleDOI
TL;DR: A conceptual model of how interdependent environmental factors shape regional-scale variation in local diversity in the deep sea is presented, showing how environmental gradients may form geographic patterns of diversity by influencing local processes such as predation, resource partitioning, competitive exclusion, and facilitation that determine species coexistence.
Abstract: Most of our knowledge of biodiversity and its causes in the deep-sea benthos derives from regional-scale sampling studies of the macrofauna. Improved sampling methods and the expansion of investigations into a wide variety of habitats have revolutionized our understanding of the deep sea. Local species diversity shows clear geographic variation on spatial scales of 100-1000 km. Recent sampling programs have revealed unexpected complexity in community structure at the landscape level that is associated with large-scale oceanographic processes and their environmental consequences. We review the relationships between variation in local species diversity and the regional-scale phenomena of boundary constraints, gradients of productivity, sediment heterogeneity, oxygen availability, hydrodynamic regimes, and catastrophic physical disturbance. We present a conceptual model of how these interdependent environmental factors shape regional-scale variation in local diversity. Local communities in the deep sea may be composed of species that exist as metapopulations whose regional distribution depends on a balance among global-scale, landscape-scale, and small-scale dynamics. Environmental gradients may form geographic patterns of diversity by influencing local processes such as predation, resource partitioning, competitive exclusion, and facilitation that determine species coexistence. The measurement of deep-sea species diversity remains a vital issue in comparing geographic patterns and evaluating their potential causes. Recent assessments of diversity using species accumulation curves with randomly pooled samples confirm the often-disputed claim that the deep sea supports higher diversity than the continental shelf. However, more intensive quantitative sampling is required to fully characterize the diversity of deep-sea sediments, the most extensive habitat on Earth. Once considered to be constant, spatially uniform, and isolated, deep-sea sediments are now recognized as a dynamic, richly textured environment that is inextricably linked to the global biosphere. Regional studies of the last two decades provide the empirical background necessary to formulate and test specific hypotheses of causality by controlled sampling designs and experimental approaches.

680 citations


Book
01 Jan 2001
TL;DR: 1. An Introduction to Sampling Analysis P. Marvasti, M.I. Zayed, P.L. Butzer, et al.
Abstract: 1. Introduction F. Marvasti. 2. An Introduction to Sampling Analysis P.L. Butzer, et al. 3. Lagrange Interpolation and Sampling Theorems A.I. Zayed, P.L. Butzer. 4. Random Topics in Nonuniform Sampling F. Marvasti. 5. Iterative and Noniterative Recovery of Missing Samples for 1-{D} Band-Limited Signals P.J.S.G. Ferreira. 6. Numerical and Theoretical Aspects of Nonuniform Sampling of Band-Limited Images K. Grochenig, T. Strohmer. 7. The Nonuniform Discrete Fourier Transform S. Bagchi, S.K. Mitra. 8. Reconstruction of Stationary Processes Sampled at Random Times B. Lacaze. 9. Zero Crossings of Random Processes with Application to Estimation and Detection J. Barnett. 10. Magnetic Resonance Image Reconstruction from Nonuniformly Sampled k-Space Data F.T.A.W. Wajer, et al. 11. Irregular and Sparse Sampling in Exploration Seismology A.J.W. Duijndam, et al. 12. Randomized Digital Optimal Control W.L. de Koning, L.G. van Willigenburg. 13. Prediction of Band-Limited Signals from Past Samples and Applications to Speech Coding D.H. Muler, Y. Wu. 14. Frames, Irregular Sampling, and a Wavelet Auditory Model J.J. Benedetto, S. Scott. 15. Application of the Nonuniform Sampling to Motion Compensated Prediction A. Sharif, et al. 16. Applications of Nonuniform Sampling to Nonlinear Modulation, A/D and D/A Techniques F. Marvasti, M. Sandler. 17. Applications to Error Correction Codes F. Marvasti. 18. Application of Nonuniform Sampling to Error Concealment M. Hasan, F. Marvasti. 19. Sparse Sampling in Array Processing S. Holm, et al. 20. Fractional Delay Filters: Design and Applications V. Valimaki, T.I. Laakso.

653 citations


Journal ArticleDOI
TL;DR: The authors describe a venue-based application of time-space sampling (TSS) that addresses the challenges of accessing hard-to-reach populations and uses it in the ongoing Community Intervention Trial for Youth (CITY) project to generate a systematic sample of young men who have sex with men.
Abstract: Constructing scientifically sound samples of hard-to-reach populations, also known as hidden populations, is a challenge for many research projects. Traditional sample survey methods, such as random sampling from telephone or mailing lists, can yield low numbers of eligible respondents while non-probability sampling introduces unknown biases. The authors describe a venue-based application of time-space sampling (TSS) that addresses the challenges of accessing hard-to-reach populations. The method entails identifying days and times when the target population gathers at specific venues, constructing a sampling frame of venue, day-time units (VDTs), randomly selecting and visiting VDTs (the primary sampling units), and systematically intercepting and collecting information from consenting members of the target population. This allows researchers to construct a sample with known properties, make statistical inference to the larger population of venue visitors, and theorize about the introduction of biases that may limit generalization of results to the target population. The authors describe their use of TSS in the ongoing Community Intervention Trial for Youth (CITY) project to generate a systematic sample of young men who have sex with men. The project is an ongoing community level HIV prevention intervention trial funded by the Centers for Disease Control and Prevention. The TSS method is reproducible and can be adapted to hard-to-reach populations in other situations, environments, and cultures.

Journal ArticleDOI
TL;DR: Digital image analysis proved to be an effective means of determining turfgrass cover, producing both accurate and reproducible data, and effectively removes the inherent error and evaluator bias commonly associated with subjective ratings.
Abstract: Accurate cover estimates in turfgrass research plots are often difficult to obtain because of the time involved with traditional sampling and evaluation techniques. Subjective ratings are commonly used to estimate turfgrass cover, but the data can be quite variable and difficult to reproduce. New technologies and software related to digital image analysis (DIA) may provide an alternative method to measure turfgrass parameters more accurately and efficiently than current techniques. A series of studies was conducted to determine the applicability of DIA for turfgrass cover estimates. In the first study, plots containing a range (1-16) of bermudagrass [Cynodon dactylon (L.) Pers.] plugs of specific diameter (15.0 cm) were established to represent values of turfgrass cover from 0.75 to 12%, by 0.75% increments. Digital images (1280 by 960 pixels) were taken with a digital camera and processed for percent green color to a software package. Estimates of green turfgrass cover by DIA were highly correlated (r 2 > 0.99) to the calculated values of turfgrass cover. In a second study, DIA of turfgrass cover was compared by subjective analysis (SA) and line-intersect analysis (LIA) methods for estimating cover in eight plots of zoysiagrass (Zoysia japonica Steudel). The mean variance of percent cover determined by DIA (0.65) was significantly lower than SA (99.12) or LIA (13.18). Digital image analysis proved to be an effective means of determining turfgrass cover, producing both accurate and reproducible data. In addition, the technique effectively removes the inherent error and evaluator bias commonly associated with subjective ratings.

01 Jan 2001
TL;DR: Comparing and contrast five experimental design types and four approximation model types in terms of their capability to generate accurate approximations for two engineering applications with typical engineering behaviors and a wide range of nonlinearity reveals that uniform designs provide good sampling for generating accurate approxIMations using different sample sizes while kriging models provide accurate approxims that are robust for use with a variety of experimental designs and sample sizes.
Abstract: Computer-based simulation and analysis is used extensively in engineering for a variety of tasks. Despite the steady and continuing growth of computing power and speed, the computational cost of complex high-fidelity engineering analyses and simulations limit their use in important areas like design optimization and reliability analysis. Statistical approximation techniques such as design of experiments and response surface methodology are becoming widely used in engineering to minimize the computational expense of running such computer analyses and circumvent many of these limitations. In this paper, we compare and contrast five experimental design types and four approximation model types in terms of their capability to generate accurate approximations for two engineering applications with typical engineering behaviors and a wide range of nonlinearity. The first example involves the analysis of a two-member frame that has three input variables and three responses of interest. The second example simulates the roll-over potential of a semi-tractor-trailer for different combinations of input variables and braking and steering levels. Detailed error analysis reveals that uniform designs provide good sampling for generating accurate approximations using different sample sizes while kriging models provide accurate approximations that are robust for use with a variety of experimental designs and sample sizes.

Journal ArticleDOI
TL;DR: This article presents an extended example using complex sample survey data to demonstrate how researchers can address problems associated with oversampling and clustering of observations in these designs.
Abstract: Most large-scale secondary data sets used in higher education research (e.g., NPSAS or BPS) are constructed using complex survey sample designs where the population of interest is stratified on a number of dimensions and oversampled within certain of these strata. Moreover, these complex sample designs often cluster lower level units (e.g., students) within higher level units (e.g., colleges) to achieve efficiencies in the sampling process. Ignoring oversampling (unequal probability of selection) in complex survey designs presents problems when trying to make inferences—data from these designs are, in their raw form, admittedly nonrepresentative of the population to which they are designed to generalize. Ignoring the clustering of observations in these sampling designs presents a second set of problems when making inferences about variability in the population and testing hypotheses and usually leads to an increased likelihood of committing Type I errors (declaring something as an effect when in fact it is not). This article presents an extended example using complex sample survey data to demonstrate how researchers can address problems associated with oversampling and clustering of observations in these designs.

Journal ArticleDOI
TL;DR: This series of articles shares with you the lessons learned from applying and evaluating research methods and their results, in the hope of improving survey research in software engineering.
Abstract: This article is the fifth installment of our series of articles on survey research. In it, we discuss what we mean by a population and a sample and the implications of each for survey research. We ...

Journal ArticleDOI
TL;DR: In this article, the authors present a phase-encoded optical sampling technique for analog-to-digital (ADC) converters with high-extinction LiNbO/sub 3/1/to-8 optical time-division demultiplexers.
Abstract: Optically sampled analog-to-digital converters (ADCs) combine optical sampling with electronic quantization to enhance the performance of electronic ADCs. In this paper, we review the prior and current work in this field, and then describe our efforts to develop and extend the bandwidth of a linearized sampling technique referred to as phase-encoded optical sampling. The technique uses a dual-output electrooptic sampling transducer to achieve both high linearity and 60-dB suppression of laser amplitude noise. The bandwidth of the technique is extended by optically distributing the post-sampling pulses to an array of time-interleaved electronic quantizers. We report on the performance of a 505-MS/s (megasample per second) optically sampled ADC that includes high-extinction LiNbO/sub 3/ 1-to-8 optical time-division demultiplexers. Initial characterization of the 505-MS/s system reveals a maximum signal-to-noise ratio of 51 dB (8.2 bits) and a spur-free dynamic range of 61 dB. The performance of the present system is limited by electronic quantizer noise, photodiode saturation, and preliminary calibration procedures. None of these fundamentally limit this sampling approach, which should enable multigigahertz converters with 12-b resolution. A signal-to-noise analysis of the phase-encoded sampling technique shows good agreement with measured data from the 505-MS/s system.

Book
06 Apr 2001
TL;DR: In this paper, the authors apply social science to the real world starting off the research process Obtaining and Using Access to an Organization Project Design Methods of Data Collection Sampling Considerations Assessing Performance in Organizations Data Analysis Reporting Research Findings
Abstract: Introduction Applying Social Science to the Real World Starting Off the Research Process Obtaining and Using Access to an Organization Project Design Methods of Data Collection Sampling Considerations Assessing Performance in Organizations Data Analysis Reporting Research Findings

Journal ArticleDOI
TL;DR: In this article, sampling-related errors of tipping-bucket (TB) rain gauge measurements were investigated, focusing on the gauge's ability to represent the small-scale rainfall temporal variability, and the results showed the importance of using fine resolution of both the sampling time interval and the bucket size so that the TB rain rate estimates have minimum levels of uncertainty.
Abstract: In this study we investigated sampling-related errors of tipping-bucket (TB) rain gauge measurements, focusing on the gauge’s ability to represent the small-scale rainfall temporal variability. We employed a simple TB simulator that used ultra-high-resolution measurements from an experimental optical rain gauge. The simulated observations were used to provide TB rain rate estimates on time scales as low as one minute. The simulation results showed that the TB estimates suffer from significant errors if based on time scales less than 10 to 15 minutes. We provide the approximate formulas used to characterize the TB sampling errors at several time scales. Our results show the importance of using fine resolution of both the sampling time interval and the bucket size so that the TB rain rate estimates have minimum levels of uncertainty.

Journal ArticleDOI
TL;DR: In this paper, a new approach was proposed to define the random error component of representative eddy correlation flux measurements of momentum, sensible and latent heat, carbon dioxide, and ozone from five field studies, three over agricultural crops (corn, soybean, and pasture), and two from towers over forests (deciduous and mixed).
Abstract: Sampling errors in eddy correlation flux measurements arise from the small number of large eddies that dominate the flux during typical sampling periods. Several methods to estimate sampling, or random error in flux measurements, have been published. These methods are compared to a more statistically rigorous method which calculates the variance of a covariance when the two variables in the covariance are auto- and cross-correlated. Comparisons are offered between the various methods. Compared to previously published methods, error estimates from this technique were 20 to 25% higher because of the incorporation of additional terms in the estimate of the variance. This new approach is then applied to define the random error component of representative eddy correlation flux measurements of momentum, sensible and latent heat, carbon dioxide, and ozone from five field studies, three over agricultural crops (corn, soybean, and pasture), and two from towers over forests (deciduous and mixed). The mean normalized error for each type of flux measurement over the five studies ranged from 12% for sensible heat flux to 31% for ozone flux. There were not large or significant differences between random errors for fluxes measured over crops versus those measured over forests. The effects of stability, flux magnitude, and wind speed on measurement error are discussed.

Journal ArticleDOI
TL;DR: Computer simulation studies by using natural collections of evolutionary parameters—rates of evolution, species sampling, and gene lengths—determined from data available in genomic databases suggest that longer sequences, rather than extensive sampling, will better improve the accuracy of phylogenetic inference.
Abstract: A major issue in all data collection for molecular phylogenetics is taxon sampling, which refers to the use of data from only a small representative set of species for inferring higher-level evolutionary history. Insufficient taxon sampling is often cited as a significant source of error in phylogenetic studies, and consequently, acquisition of large data sets is advocated. To test this assertion, we have conducted computer simulation studies by using natural collections of evolutionary parameters—rates of evolution, species sampling, and gene lengths—determined from data available in genomic databases. A comparison of the true tree with trees constructed by using taxa subsamples and trees constructed by using all taxa shows that the amount of phylogenetic error per internal branch is similar; a result that holds true for the neighbor-joining, minimum evolution, maximum parsimony, and maximum likelihood methods. Furthermore, our results show that even though trees inferred by using progressively larger taxa subsamples of a real data set become increasingly similar to trees inferred by using the full sample, all inferred trees are equidistant from the true tree in terms of phylogenetic error per internal branch. Our results suggest that longer sequences, rather than extensive sampling, will better improve the accuracy of phylogenetic inference.

Journal ArticleDOI
TL;DR: The authors compared simulations of anthropogenic CO2 in the four three-dimensional ocean models that participated in the first phase of the Ocean Carbon Cycle Model Intercomparison Project (OCMIP), as a means to identify their major differences.
Abstract: We have compared simulations of anthropogenic CO2 in the four three-dimensional ocean models that participated in the first phase of the Ocean Carbon-Cycle Model Intercomparison Project (OCMIP), as a means to identify their major differences. Simulated global uptake agrees to within ±19%, giving a range of 1.85±0.35 Pg C yr−1 for the 1980–1989 average. Regionally, the Southern Ocean dominates the present-day air-sea flux of anthropogenic CO2 in all models, with one third to one half of the global uptake occurring south of 30°S. The highest simulated total uptake in the Southern Ocean was 70% larger than the lowest. Comparison with recent data-based estimates of anthropogenic CO2 suggest that most of the models substantially overestimate storage in the Southern Ocean; elsewhere they generally underestimate storage by less than 20%. Globally, the OCMIP models appear to bracket the real ocean's present uptake, based on comparison of regional data-based estimates of anthropogenic CO2 and bomb 14C. Column inventories of bomb 14C have become more similar to those for anthropogenic CO2 with the time that has elapsed between the Geochemical Ocean Sections Study (1970s) and World Ocean Circulation Experiment (1990s) global sampling campaigns. Our ability to evaluate simulated anthropogenic CO2 would improve if systematic errors associated with the data-based estimates could be provided regionally.

Journal ArticleDOI
TL;DR: Existing studies suggest that bee faunas are locally diverse, highly variable in space and time, and often rich in rare species, indicating that intense sampling among sites and years will be required to differentiate changes due to specific impacts from the natural dynamics of populations and communities.
Abstract: Introduction Methods Comparing studies Levels of spatial and temporal variation Effects of sampling effort and area Predictability of subsampling Results ABSTRACT Changes in flower-visiting insect populations or communities that result from human impacts can be documented by measuring spatial or temporal trends, or by comparing abundance or species composition before and after disturbance. The level of naturally occurring variation in populations and communities over space and time will dictate the sampling effort required to detect human-induced changes. We compiled a set of existing surveys of the bee faunas of natural communities from around the world to examine patterns of abundance and richness. We focused on a subset of these studies to illustrate variation in bee communities among different sites and within sites over different spatial and temporal scales. We used examples from our compilation and other published studies to illustrate sampling approaches that maximize the value of future sampling efforts. Existing studies suggest that bee faunas are locally diverse, highly variable in space and time, and often rich in rare species. All of these attributes indicate that intense sampling among sites and years will be required to differentiate changes due to specific impacts from the natural dynamics of populations and communities. Given the limits on

Journal ArticleDOI
TL;DR: In this paper, the authors developed methods for adjusting grid box average temperature time series for the effects on variance of changing numbers of contributing data, and used different techniques over land and ocean.
Abstract: We develop methods for adjusting grid box average temperature time series for the effects on variance of changing numbers of contributing data. Owing to the different sampling characteristics of the data, we use different techniques over land and ocean. The result is to damp average temperature anomalies over a grid box by an amount inversely related to the number of contributing stations or observations. Variance corrections influence all grid box time series but have their greatest effects over data sparse oceanic regions. After adjustment, the grid box land and ocean surface temperature data sets are unaffected by artificial variance changes which might affect, in particular, the results of analyses of the incidence of extreme values. We combine the adjusted land surface air temperature and sea surface temperature data sets and apply a limited spatial interpolation. The effects of our procedures on hemispheric and global temperature anomaly series are small.

Book
08 Feb 2001
TL;DR: The objective of monitoring is to identify and select among Priorities for Sampling Design, Analysis of Trends, and Qualitative Techniques For Monitoring the most effective ways to collect and manage data.
Abstract: Preface. 1. Introduction To Monitoring. 2. Monitoring Overview. 3. Selecting Among Priorities. 4. Qualitative Techniques For Monitoring. 5. General Field Techniques. 6. Data Collection And Data Management. 7. Basic Principles Of Sampling. 8. Sampling Design. 9. Statistical Analysis. 10. Analysis Of Trends. 11. Selecting Random Samples. 12. Field Techniques For Measuring Vegetation. 13. Specialized Sampling Methods And Field Techniques For Animals. 14. Objectives. 15. Communication And Monitoring Plans. Appendix I: Monitoring Communities. Appendix II: Sample Size Equations. Appendix III: Confidence Interval Equations. Appendix IV: Sample Size And Confidence Intervals For Complex Sampling Designs. Literature Cited. Index References

Journal ArticleDOI
TL;DR: In this article, the authors introduce a method for validation of results obtained by clustering analysis of data based on resampling the available data, and a figure of merit that measures the stability of clustering solutions against resample is introduced.
Abstract: We introduce a method for validation of results obtained by clustering analysis of data. The method is based on resampling the available data. A figure of merit that measures the stability of clustering solutions against resampling is introduced. Clusters that are stable against resampling give rise to local maxima of this figure of merit. This is presented first for a one-dimensional data set, for which an analytic approximation for the figure of merit is derived and compared with numerical measurements. Next, the applicability of the method is demonstrated for higher-dimensional data, including gene microarray expression data.

Proceedings Article
Phillip B. Gibbons1
11 Sep 2001
TL;DR: This work presents an approach, called distinct sampling, that collects a specially tailored sample over the distinct values in the input, in a single scan of the data, and shows how it can provide fast, highlyaccurate approximate answers for “report” queries in high-volume, session-based event recording environments, such as IP networks, customer service call centers, etc.
Abstract: Estimating the number of distinct values is a wellstudied problem, due to its frequent occurrence in queries and its importance in selecting good query plans. Previous work has shown powerful negative results on the quality of distinct-values estimates based on sampling (or other techniques that examine only part of the input data). We present an approach, called distinct sampling, that collects a specially tailored sample over the distinct values in the input, in a single scan of the data. In contrast to the previous negative results, our small Distinct Samples are guaranteed to accurately estimate the number of distinct values. The samples can be incrementally maintained up-to-date in the presence of data insertions and deletions, with minimal time and memory overheads, so that the full scan may be performed only once. Moreover, a stored Distinct Sample can be used to accurately estimate the number of distinct values within any range specified by the query, or within any other subset of the data satisfying a query predicate. We present an extensive experimental study of distinct sampling. Using synthetic and real-world data sets, we show that distinct sampling gives distinct-values estimates to within 0%‐10% relative error, whereas previous methods typically incur 50%‐250% relative error. Next, we show how distinct sampling can provide fast, highlyaccurate approximate answers for “report” queries in high-volume, session-based event recording environments, such as IP networks, customer service call centers, etc. For a commercial call center environment, we show that a 1% Distinct Sample

Proceedings ArticleDOI
01 Jul 2001
TL;DR: This work experimentally evaluates the effectiveness of using cluster analysis of execution profiles to find failures among the executions induced by a set of potential test cases and suggests that filtering procedures based on clustering are more effective than simple random sampling for identifying failures in populations of operational executions.
Abstract: We experimentally evaluate the effectiveness of using cluster analysis of execution profiles to find failures among the executions induced by a set of potential test cases. We compare several filtering procedures for selecting executions to evaluate for conformance to requirements. Each filtering procedure involves a choice of a sampling strategy and a clustering metric. The results suggest that filtering procedures based on clustering are more effective than simple random sampling for identifying failures in populations of operational executions, with adaptive sampling from clusters being the most effective sampling strategy. The results also suggest that clustering metrics that give extra weight to industrial profile features are most effective. Scatter plots of execution populations, produced by multidimensional scaling, are used to provide intuition for these results.

Proceedings ArticleDOI
03 Jul 2001
TL;DR: The distributed streams model is related to previously studied non-distributed (i.e., merged) streams models, presenting tight bounds on the gap between the distributed and merged models for deterministic algorithms, and employs a novel coordinated sampling technique to extract a sample of the union.
Abstract: Massive data sets often arise as physically distributed, parallel data streams. We present algorithms for estimating simple functions on the union of such data streams, while using only logarithmic space per stream. Each processor observes only its own stream, and communicates with the other processors only after observing its entire stream. This models the set-up in current network monitoring products. Our algorithms employ a novel coordinated sampling technique to extract a sample of the union; this sample can be used to estimate aggregate functions on the union. The technique can also be used to estimate aggregate functions over the distinct “labels” in one or more data streams, e.g., to determine the zeroth frequency moment (i.e., the number of distinct labels) in one or more data streams. Our space and time bounds are the best known for these problems, and our logarithmic space bounds for coordinated sampling contrast with polynomial lower bounds for independent sampling. We relate our distributed streams model to previously studied non-distributed (i.e., merged) streams models, presenting tight bounds on the gap between the distributed and merged models for deterministic algorithms.

Patent
26 Mar 2001
TL;DR: In this article, a method and device for combining the sampling and analyzing of sub-dermal fluid samples, e.g., interstitial fluid or whole blood, in a device suitable for hospital bedside and home use is presented.
Abstract: The invention disclosed in this application is a method and device for combining the sampling and analyzing of sub-dermal fluid samples, e.g., interstitial fluid or whole blood, in a device suitable for hospital bedside and home use. It is applicable to any analyte that exists in a usefully representative concentration in the fluid, and is especially suited to the monitoring of glucose.

Journal ArticleDOI
TL;DR: This paper showed that the specific relationship between ENSO and AIR is significantly less variable on decadal timescales than should be expected from sampling variability alone, and it is shown that this decadal modulation could be due solely to stochastic processes.
Abstract: Running correlations between pairs of stochastic time series are typically characterized by low-frequency evolution. This simple result of sampling variability holds for climate time series but is not often recognized for being merely noise. As an example, this paper discusses the historical connection between El Nino–Southern Oscillation (ENSO) and average Indian rainfall (AIR). Decades of strong correlation (∼−0.8) alternate with decades of insignificant correlation, and it is shown that this decadal modulation could be due solely to stochastic processes. In fact, the specific relationship between ENSO and AIR is significantly less variable on decadal timescales than should be expected from sampling variability alone.