scispace - formally typeset
Search or ask a question

Showing papers on "Bernoulli sampling published in 2009"


Proceedings ArticleDOI
29 Mar 2009
TL;DR: The experimental results show that the accuracy of the sketch computed over a small sample of the data is, in general, close to the accuracy the sketch estimator computed over the entire data even when the sample size is only $10\%$ or less of the dataset size.
Abstract: Sampling is used as a universal method to reduce the running time of computations -- the computation is performed on a much smaller sample and then the result is scaled to compensate for the difference in size. Sketches are a popular approximation method for data streams and they proved to be useful for estimating frequency moments and aggregates over joins. A possibility to further improve the time performance of sketches is to compute the sketch over a sample of the stream rather than the entire data stream.In this paper we analyze the behavior of the sketch estimator when computed over a sample of the stream, not the entire data stream, for the size of join and the self-join size problems. Our analysis is developed for a generic sampling process. We instantiate the results of the analysis for all three major types of sampling -- Bernoulli sampling which is used for load shedding, sampling with replacement which is used to generate i.i.d. samples from a distribution, and sampling without replacement which is used by online aggregation engines -- and compare these particular results with the results of the basic sketch estimator. Our experimental results show that the accuracy of the sketch computed over a small sample of the data is, in general, close to the accuracy of the sketch estimator computed over the entire data even when the sample size is only $10\%$ or less of the dataset size. This is equivalent to a speed-up factor of at least $10$ when updating the sketch.

27 citations


Proceedings ArticleDOI
24 Mar 2009
TL;DR: This paper addresses the problem of estimating commonly occurring aggregates in time-constrained approximate SQL queries by giving both point and interval estimates for SUM, COUNT, AVG, MEDIAN, MIN, and MAX using Bernoulli sampling.
Abstract: The concept of time-constrained SQL queries was introduced to address the problem of long-running SQL queries. A key approach adopted for supporting time-constrained SQL queries is to use sampling to reduce the amount of data that needs to be processed, thereby allowing completion of the query in the specified time constraint. However, sampling does make the query results approximate and hence requires the system to estimate the values of the expressions (especially aggregates) occurring in the select list. Thus, coming up with estimates for aggregates is crucial for time-constrained approximate SQL queries to be useful, which is the focus of this paper. Specifically, we address the problem of estimating commonly occurring aggregates (namely, SUM, COUNT, AVG, MEDIAN, MIN, and MAX) in time-constrained approximate queries. We give both point and interval estimates for SUM, COUNT, AVG, and MEDIAN using Bernoulli sampling for various type of queries, including join processing with cross product sampling. For MIN (MAX), we give the confidence level that the proportion 100γ% of the population will exceed the MIN (or be less than the MAX) obtained from the sampled data.

15 citations


Journal ArticleDOI
TL;DR: In this article, a modification to the variant of link-tracing sampling suggested by Felix-Medina and Thompson [M.H.Felix and S.K.Thompson, 2004] was proposed to allow the researcher to have certain control of the final sample size, precision of the estimates or other characteristics of the sample that the researcher is interested in controlling.

8 citations


Journal ArticleDOI
TL;DR: Estimation of the population average in a finite population by means of sampling strategy dependent on the sample quantile of an auxiliary variable using the well known Horvitz-Thompson estimator is considered.
Abstract: Estimation of the population average in a finite population by means of sampling strategy dependent on the sample quantile of an auxiliary variable is considered. The sampling design is proportionate to the difference of two quantiles of an auxiliary variable. The sampling scheme implementing the sampling design is proposed. The derived inclusion probabilities are applied to estimation the population mean using the well known Horvitz-Thompson estimator. Moreover, the regression estimator is defined as the function of the slope coefficient dependent on the quantiles of the auxiliary variable.

7 citations


Journal ArticleDOI
TL;DR: The repeated Poisson sampling (RP) as discussed by the authors is a new sampling technique for selecting a sample of fixed size with unequal inclusion probabilities, which is very similar to the conditional Poisson sample selection.

6 citations


01 Jan 2009
TL;DR: This chapter is to give the reader a brief introduction to probability and the basis of sampling in probability, to cover basic sampling methods with a focus on probability sampling, and to illustrate these sampling methods by use of numerical examples that the reader can replicate.
Abstract: robability theory concerns the relative frequency with which certain events occur. Probability is important in sampling because it is the vehicle that allows the researcher to use the information in a sample to make inferences about the population from which the sample was obtained. The purpose of this chapter is to give an overview of probability and sampling. The goals of this chapter are (a) to give the reader a brief introduction to probability and the basis of sampling in probability, (b) to cover basic sampling methods with a focus on probability sampling, and (c) to illustrate these sampling methods by use of numerical examples that the reader can replicate.

3 citations


01 Jan 2009
TL;DR: In this article, the authors provide the theory for estimating the variance of the difference in two years' domain-level totals under the stratified Bernoulli sample design and apply it to data from the Statistics of Income Division's individual income tax return sample.
Abstract: This paper provides the theory for estimating the variance of the difference in two years' domain-level totals under the stratified Bernoulli sample design. Henry et al. (2008) developed an approximately design-unbiased variance estimator that used poststratification to correct for the random sample sizes created under Bernoulli sampling. We modify the Henry et al. (2008) variance estimator for the estimated change in domain-level totals. We consider both "planned domains," domains that are related to the sample design variables, and "analysis domains," unplanned domains of interest at the analysis stage. Our variance estimator takes into account three practical problems: a large overlap of units between two years' samples, changing compositions of units across years that produce "stratum jumpers," which are population and sample units that shift across strata from one year to another (Rivest, 1999), and changes in sampling rates across years. These problems affect estimating the covariance term in the variance of the difference. The variance estimator is applied to data from the Statistics of Income Division's individual income tax return sample. Naive variance estimates using only the separate years' variances are compared to show the effect of ignoring the estimated covariance.