scispace - formally typeset
Open AccessJournal ArticleDOI

Efficient balanced sampling: The cube method

Jean-Claude Deville, +1 more
- 01 Dec 2004 - 
- Vol. 91, Iss: 4, pp 893-912
TLDR
The cube method as discussed by the authors selects approximately balanced samples with equal or unequal inclusion probabilities and any number of auxiliary variables, depending on the correlations of these variables with the controlled variables, i.e., the correlation of the variables of interest with the control variables.
Abstract
A balanced sampling design is defined by the property that the Horvitz-Thompson estimators of the population totals of a set of auxiliary variables equal the known totals of these variables. Therefore the variances of estimators of totals of all the variables of interest are reduced, depending on the correlations of these variables with the controlled variables. In this paper, we develop a general method, called the cube method, for selecting approximately balanced samples with equal or unequal inclusion probabilities and any number of auxiliary variables.

read more

Content maybe subject to copyright    Report

Biometrika (2004), 91, 4, pp. 893912
© 2004 Biometrika Trust
Printed in Great Britain
Ecient balanced sampling: The cube method
B JEAN-CLAUDE DEVILLE
L aboratoire de Statistique d’Enque
@
te, CREST –ENSAI,
E
´
cole Nationale de la Statistique et de l’Analyse de l’Information, rue Blaise Pascal,
Campus de Ker L ann, 35170 Bruz, France
deville@ensai.fr
 YVES TILLE
´
Groupe de Statistique, Universite
´
de Neucha
@
tel, Espace de l’Europe 4, Case postale 805,
2002 Neucha
@
tel, Switzerland
yves.tille@unine.ch
S
A balanced sampling design is defined by the property that the Horvitz–Thompson
estimators of the population totals of a set of auxiliary variables equal the known totals
of these variables. Therefore the variances of estimators of totals of all the variables of
interest are reduced, depending on the correlations of these variables with the controlled
variables. In this paper, we develop a general method, called the cube method, for selecting
approximately balanced samples with equal or unequal inclusion probabilities and any
number of auxiliary variables.
Some key words: Calibration; Poststratification; Quota sampling; Sampling algorithm; Stratification; Sunter’s
method; Unequal selection probabilities.
1. I
The use of auxiliary information is a central issue in survey sampling from finite
populations. The classical techniques that use auxiliary information in a sampling design
are stratification (Neyman, 1934; Tschuprow, 1923) and unequal probability sampling or
sampling proportional to size (Hansen & Hurwitz, 1943; Madow, 1949).
The problem of balanced sampling is an old one and has not yet been solved. Kiaer
(1896), founder of modern sampling, argued for samples that match the means of known
variables to obtain what he called ‘representative samples’. He advocated purposive
methods before the development of the idea of probability sampling proposed by Neyman
(1934, 1938). Yates (1949) also insisted on the idea of respecting the means of known
variables in probability samples because the variance is then reduced. Yates (1946) and
Thionet (1953, pp. 2037) have described limited and heavy methods of balanced sampling.
Ha
´
jek (1964; 1981, p. 157) gives a rigorous definition of a representative strategy and
its properties. According to Ha
´
jek, a strategy is a pair composed of a sampling design
and an estimator, the strategy being representative if it estimates exactly the total of
an auxiliary variable. He showed that a representative strategy could be achieved by
regression, but he did not succeed in finding a representative sampling method associated

894 J-C D  Y T
with the Horvitz–Thompson estimator other than the rejective procedure, which consists
of selecting new samples until a balanced sample is found. Royall & Herson (1973)
stressed the importance of balancing a sample in order to protect the inference against
a misspecified model. They called this idea ‘robustness’. Since no method existed for
achieving a multivariate balanced sample, they proposed the use of simple random
sampling, which is ‘mean-balanced’ with large samples. Several partial solutions were
proposed by Deville et al. (1988), Deville (1992), Ardilly (1991) and Hedayat & Majumdar
(1995), but a general solution for balanced sampling was never found. Recently, Valliant
et al. (2000) surveyed some existing methods.
In this paper, we propose a general method, the cube method, that allows the selection
of approximately balanced samples, in that the Horvitz–Thompson estimates for the
auxiliary variables are equal, or nearly equal, to their population totals. The method is
appropriate for a large set of qualitative or quantitative balancing variables, it allows
unequal inclusion probabilities, and it permits us to understand how accurately a
sample can be balanced. Moreover, the sampling design respects any fixed, equal or
unequal, inclusion probabilities. The method can be viewed as a generalisation of the
splitting procedure (Deville & Tille
´
, 1998) which allows easy construction of new unequal
probability sampling methods.
Since its conception, the cube method has aroused great interest amongst survey
statisticians at the Institut National de la Statistique et des E
´
tudes E
´
conomiques (INSEE),
the French Bureau of Statistics. A first application of the method was implemented in
SAS-IML by A. Bousabaa, J. Lieber, R. Sirolli and F. Tardieu. This macro allows the
selection of samples with unequal probabilities of up to 50 000 units and 30 balancing
variables. The INSEE has adopted the cube method for its most important statistical
projects. In the redesigned census in France, a fifth of the municipalities with fewer than
5000 inhabitants are sampled each year, so that after five years all the municipalities will
be selected. All the households in these municipalities are surveyed. The five samples of
municipalities are selected with equal probabilities using the cube method and are balanced
on a set of demographic variables (Dumais & Isnard, 2000).
The demand for such sampling methods is very strong. In the French National Statistical
Institute, the use of balanced sampling in several projects improved eciency dramatically,
allowing a reduction of the variance by 20 to 90% in comparison to simple random
sampling.
2. F   ,  
Consider a finite population U of size N whose units can be identified by labels
kµ{1,...,N}. The aim is to estimate the total Y = W
kµU
y
k
of a variable of interest y
that takes the values y
k
(kµU) for the units of the population. Suppose also that the
vectors of values x
k
=(x
k1
...x
kj
...x
kp
) taken by p auxiliary variables are known for all
the units of the population. The p vectors (x
1j
...x
kj
...x
Nj
), for j=1,...,p, are assumed
without loss of generality to be linearly independent.
A sample is denoted by a vector s=(s
1
...,s
k
...,s
N
), where s
k
takes the value 1
if k is in the sample and is 0 otherwise. A sampling design p(.) is a probability distri-
bution on the set S={0, 1}N of all the possible samples. The random sample S takes
the value s with probability pr (S=s)=p(s). The inclusion probability of unit k is the
probability p
k
=pr (S
k
=1) that unit k is in the sample and the joint inclusion probability
is the probability p
kl
=pr (S
k
=1 and S
l
=1) that two distinct units are jointly in the

895EYcient balanced sampling: T he cube method
sample. The Horvitz–Thompson estimator given by Y
C
= W
kµU
S
k
y
k
/p
k
is an unbiased esti-
mator of Y . The Horvitz–Thompson estimator of the jth auxiliary total X
j
= W
kµU
x
kj
is X
C
j
= W
kµU
S
k
x
kj
/p
k
. The Horvitz–Thompson estimator vector, X
C
= W
kµU
S
k
x
k
/p
k
,
estimates without bias the totals of the auxiliary variables, X= W
kµU
x
k
.
The aim is to construct a balanced sampling design, defined as follows.
D 1. A sampling design p(s) is said to be balanced on the auxiliary variables,
x
1
,...,x
p
, if and only if it satisfies the balancing equations given by
X
C
=X, (1)
which can also be written as
kµU
s
k
x
kj
p
k
=∑
kµU
x
kj
,
for all sµS such that p(s)>0.
Remark.Ifthey
k
are linear combinations of the x
k
, that is y
k
=x
k
b for all k, where b
is a vector of constants, then Y
C
=Y . More generally, if the y
k
are well predicted by a linear
combination of the x
k
, one can expect var (Y
C
) to be small.
Next consider the following three particular cases of balanced sampling.
Example 1. A sampling design of fixed sample size n is balanced on the variable
x
k
=p
k
(kµU) because
kµU
S
k
x
k
p
k
=∑
kµU
S
k
=n.
Example 2. Suppose that the design is stratified and that, from each stratum U
h
(h=1,...,H)of size N
h
, a simple random sample of size n
h
is selected. Then the design
is balanced on the variables d
kh
, where
d
kh
=
q
1, if kµU
h
,
0, if k1U
h
.
In this case, we have
kµU
S
k
d
kh
p
k
=∑
kµU
S
k
d
kh
N
h
n
h
=N
h
(h=1,...,H).
Example 3. In sampling with unequal probabilities, when all the inclusion probabilities
are dierent, the Horvitz–Thompson estimator N
C
= W
kµU
S
k
/p
k
of the population size N
is generally random. When the population size is known before selecting the sample, it
could be important to select a sample such that
kµU
S
k
p
k
=N. (2)
Equation (2) is a balancing equation, in which the balancing variable is x
k
=1(kµU).
Until now, there has been no method by which (2) can be approximately satisfied for
arbitrary inclusion probabilities, but we will see that this balancing equation can be
satisfied by means of the cube method.

896 J-C D  Y T
Stratification and unequal probability sampling are thus special cases of balanced
sampling. In § 6, we present new cases, but the main practical interest of balanced sampling
lies in its generality. Nevertheless, in most cases, the balancing equations (1) cannot be
exactly satisfied, as the following example shows.
Example 4. Suppose that N=10, n=7, p
k
= 7
10
(kµU) and that the only auxiliary
variable is x
k
=k(kµU). Then a balanced sample satisfies
kµU
S
k
k
p
k
=∑
kµU
k,
so that W
kµU
kS
k
has to be equal to 55× 7
10
=38·5, which is impossible because 38·5 is not
an integer. The problem arises because 1/p
k
is not an integer and the population size
is small.
Consequently, our objective is to construct a sampling design which satisfies the
balancing equations (1) exactly if possible, and to find the best approximation if this
cannot be achieved. The rounding problem becomes negligible when the expected sample
size is large.
3. C    
The cube method is based on a geometric representation of the sampling design. The 2N
possible samples correspond to 2N vectors of RN in the following way. Each vector s is a
vertex of an N-cube, and the number of possible samples is the number of vertices of an
N-cube. A sampling design with inclusion probabilities p
k
(kµU) consists of assigning a
probability p(s) to each vertex of the N-cube such that
E(s)=∑
sµS
p(s)=p,
where p =(p
k
) is the vector of inclusion probabilities. Geometrically, a sampling design
consists of expressing the vector p as a convex combination of the vertices of the N-cube.
A sampling algorithm can thus be viewed as a ‘random’ way of reaching a vertex of the
N-cube from a vector p in such a way that the balancing equations (1) are satisfied.
Figure 1 shows the geometric representation of the possible samples from a population
of size N=3.
The cube method is composed of two phases called the flight phase and the landing
phase. In the flight phase, the constraints are exactly satisfied. The objective is to round
Fig. 1. Geometric representation of possible samples
in a population of size N=3.

897EYcient balanced sampling: T he cube method
o randomly to 0 or 1 almost all the inclusion probabilities. The landing phase consists
of coping as well as possible with the fact that the balancing equations (1) cannot always
be satisfied exactly.
The balancing equations (1) can also be written
kµU
a
k
s
k
=∑
kµU
a
k
p
k
,s
k
µ{0, 1}, kµU, (3)
where a
k
=x
k
/p
k
(kµU) and s
k
equals 1 if unit k is in the sample and 0 otherwise. The
first equation of (3) with given a
k
and coordinates s
k
defines a hyperplane Q in RN of
dimension Np. Note that Q=p+kerA, where kerA is the kernel or null-space of the
p×N matrix A given by A=(a
1
...a
k
...a
N
). The main idea in obtaining a balanced
sample is to choose a vertex of the N-cube that remains in the hyperplane Q or near to Q
if that is not possible.
If C=[0, 1]N denotes the N-cube in RN whose vertices are the samples of U, the
intersection between C and Q is nonempty, because p is in the interior of C and belongs
to Q. The intersection between an N-cube and a hyperplane defines a polytope K=CmQ,
which is of dimension (Np) because it is the intersection of an N-cube and a plane, of
dimension (Np), that has a point in the interior of C.
D 2. L et D be a convex polyhedron. A vertex, or extremal point, of D is defined
as a point that cannot be expressed as a convex linear combination of other points of D. T he
set of all the vertices of D is denoted by Ext (D).
D 3. A sample s is said to be exactly balanced if sµExt (C)mQ.
Note that a necessary condition for finding an exactly balanced sample is that
Ext (C)mQNB.
D 4. A balancing equation system is
(i) exactly satisfied if Ext (C)mQ=Ext (CmQ),
(ii) approximately satisfied if Ext (C)mQ=B,
(iii) sometimes satisfied if Ext (C)mQNExt (CmQ) and Ext (C)mQNB.
Whether the balancing equation system is exactly satisfied, approximately satisfied or
sometimes satisfied depends on the values of p and A.
P 1. If r =(r
k
) is a vertex of K then #{k|0<r
k
<1}p, where p is the number
of auxiliary variables, and #(B) denotes the cardinality of a set B.
Proof. Let A* be the submatrix of A consisting of the columns corresponding to non-
integer components of the vector r.Ifq=#(U*)>p, then kerA* has dimension qp>0,
and r is not an extreme point of K. %
The following three examples show that the rounding problem can be viewed geo-
metrically. Indeed, the balancing equations cannot be exactly satisfied when the vertices
of K are not vertices of C, that is when q>0.
Example 5. In Fig. 2(a), a sampling design for a population of size N=3 is considered.
The only constraint consists of fixing the sample size n=2, and thus p=1 and x
k
=p
k
(kµU). The inclusion probabilities satisfy p
1
+p
2
+p
3
=2, so that the balancing equation
is exactly satisfied.

Citations
More filters
Journal ArticleDOI

Handling class imbalance in customer churn prediction

TL;DR: It is found that there is no need to under-sample so that there are as many churners in your training set as non churners, and under-sampling can lead to improved prediction accuracy, especially when evaluated with AUC.
Journal ArticleDOI

Comparing Oversampling Techniques to Handle the Class Imbalance Problem: A Customer Churn Prediction Case Study

TL;DR: The empirical results demonstrate that the overall predictive performance of MTDF and rules-generation based on genetic algorithms performed the best as compared with the rest of the evaluated oversampling methods and rule-generation algorithms.
Journal ArticleDOI

Socioeconomic impacts of COVID-19 in low-income countries.

TL;DR: In this article, the authors document the socioeconomic impacts of the SARS-CoV-2 pandemic among households, adults and children in low-income countries, and find that student-teacher contact has dropped from a pre-COVID-19 rate of 96% to just 17% among households with school-aged children.
Proceedings ArticleDOI

TRIÈST: Counting Local and Global Triangles in Fully-Dynamic Streams with Fixed Memory Size

TL;DR: This work presents TRIEST, a suite of one-pass streaming algorithms to compute unbiased, low-variance, high-quality approximations of the global and local number of triangles in a fully-dynamic graph represented as an adversarial stream of edge insertions and deletions.
Journal ArticleDOI

Geostatistical Model-Based Estimates of Schistosomiasis Prevalence among Individuals Aged ≤20 Years in West Africa

TL;DR: The first empirical estimates for S. mansoni and S. haematobium prevalence at high spatial resolution throughout West Africa are presented, which allow prioritizing of interventions in a spatially explicit manner, and will be useful for monitoring and evaluation of schistosomiasis control programs.
References
More filters
Book

Model assisted survey sampling

TL;DR: This book presents the principles of Estimation for Finite Populations and Important Sampling Designs and a Broader View of Errors in Surveys: Nonsampling Errors and Extensions of Probability Sampling Theory.

On the Two Different Aspects of the Representative Method: the Method of Stratified Sampling and the Method of Purposive Selection

erzy Neyman
TL;DR: The representative method has attracted the attention of many statisticians in different countries as discussed by the authors, mainly due to the general crisis, to the scarcity of money and to the necessity of carrying out statistical investigations connected with social life in a somewhat hasty way.
Book ChapterDOI

On the Two Different Aspects of the Representative Method: the Method of Stratified Sampling and the Method of Purposive Selection

TL;DR: The popularity of the representative method is also partly due to the general crisis, to the scarcity of money and to the necessity of carrying out statistical investigations connected with social life in a somewhat hasty way.
Related Papers (5)
Frequently Asked Questions (11)
Q1. What have the authors contributed in "Efficient balanced sampling: the cube method" ?

In this paper, the authors develop a general method, called the cube method, for selecting approximately balanced samples with equal or unequal inclusion probabilities and any number of auxiliary variables. 

Balanced sampling protects against extreme or negative weights, which, as mentioned before, can be very problematic, particularly with small samples. 

The second step consists of taking v(1)= (0 1 0 . . . 0)∞.The second way consists of sorting the data randomly before applying the cube method with any vectors v(t). 

When the population size is known before selecting the sample, it could be important to select a sample such that∑ kµU S k p k =N. (2)Equation (2) is a balancing equation, in which the balancing variable is x k =1 (kµU). 

Nearly all existing methods, except the rejective ones and the variations of systematic sampling, can easily be implemented by means of the cube method. 

At the end of the flight phase, a vertex of K is chosen randomly in such a way that the inclusion probabilities pk (kµU) and the balancing equations (1) are exactly satisfied. 

A variance approximation is proposed for balanced sampling based on regression residuals, which is validated by a theoretical development and a large set of simulations. 

In order to satisfy this constraint, expression (5) implies that∑ kµU u k (t)=0. (13)Each choice, random or not, of vectors u(t) that satisfy (13) produces another method for sampling with unequal probability. 

For generating the vector u(t), the authors first generate any, random or not, vector v(t)= {vk (t)} in RN, that is independentof p(t−1), . . . , p(1). 

The calibration estimator is defined asYC R =YC+ (X−XC )∞b,whereb=A ∑ kµU s k x k x∞ k p k B−1 ∑ kµU s k x k y k p kis the ‘standard’ probability weighted estimator. 

With some adjustments, the cube method can thus be applied to any sampling frame, even with millions of units and a large number of auxiliary variables.