What is the strategy for using balanced sampling?

Balanced sampling protects against extreme or negative weights, which, as mentioned before, can be very problematic, particularly with small samples.

What is the second way to sort the data?

The second step consists of taking v(1)= (0 1 0 . . . 0)∞.The second way consists of sorting the data randomly before applying the cube method with any vectors v(t).

What is the balancing variable in equation 2?

When the population size is known before selecting the sample, it could be important to select a sample such that∑ kµU S k p k =N. (2)Equation (2) is a balancing equation, in which the balancing variable is x k =1 (kµU).

How can the authors implement the cube method?

Nearly all existing methods, except the rejective ones and the variations of systematic sampling, can easily be implemented by means of the cube method.

What is the general method for detecting when the balancing equations are exactly satisfied?

At the end of the flight phase, a vertex of K is chosen randomly in such a way that the inclusion probabilities pk (kµU) and the balancing equations (1) are exactly satisfied.

What is the variance approximation for balanced sampling?

A variance approximation is proposed for balanced sampling based on regression residuals, which is validated by a theoretical development and a large set of simulations.

What is the way to solve the problem of sampling with unequal probability?

In order to satisfy this constraint, expression (5) implies that∑ kµU u k (t)=0. (13)Each choice, random or not, of vectors u(t) that satisfy (13) produces another method for sampling with unequal probability.

What is the first step in the generating of the vector u(t)?

For generating the vector u(t), the authors first generate any, random or not, vector v(t)= {vk (t)} in RN, that is independentof p(t−1), . . . , p(1).

What is the standard probability weighted estimator?

The calibration estimator is defined asYC R =YC+ (X−XC )∞b,whereb=A ∑ kµU s k x k x∞ k p k B−1 ∑ kµU s k x k y k p kis the ‘standard’ probability weighted estimator.

How many auxiliary variables can be used to calculate the variance?

With some adjustments, the cube method can thus be applied to any sampling frame, even with millions of units and a large number of auxiliary variables.

(Open Access) Efficient balanced sampling: The cube method (2004) | Jean-Claude Deville

Q: What have the authors contributed in "Efficient balanced sampling: the cube method" ?

In this paper, the authors develop a general method, called the cube method, for selecting approximately balanced samples with equal or unequal inclusion probabilities and any number of auxiliary variables.

Biometrika (2004), 91, 4, pp. 893–912

Printed in Great Britain

Eﬃcient balanced sampling: The cube method

B JEAN-CLAUDE DEVILLE

L aboratoire de Statistique d’Enque

te, CREST –ENSAI,

cole Nationale de la Statistique et de l’Analyse de l’Information, rue Blaise Pascal,

Campus de Ker L ann, 35170 Bruz, France

deville@ensai.fr

 YVES TILLE

Groupe de Statistique, Universite

de Neucha

tel, Espace de l’Europe 4, Case postale 805,

2002 Neucha

tel, Switzerland

yves.tille@unine.ch

S

A balanced sampling design is deﬁned by the property that the Horvitz–Thompson

estimators of the population totals of a set of auxiliary variables equal the known totals

of these variables. Therefore the variances of estimators of totals of all the variables of

interest are reduced, depending on the correlations of these variables with the controlled

variables. In this paper, we develop a general method, called the cube method, for selecting

approximately balanced samples with equal or unequal inclusion probabilities and any

number of auxiliary variables.

Some key words: Calibration; Poststratiﬁcation; Quota sampling; Sampling algorithm; Stratiﬁcation; Sunter’s

method; Unequal selection probabilities.

1. I

The use of auxiliary information is a central issue in survey sampling from ﬁnite

populations. The classical techniques that use auxiliary information in a sampling design

are stratiﬁcation (Neyman, 1934; Tschuprow, 1923) and unequal probability sampling or

sampling proportional to size (Hansen & Hurwitz, 1943; Madow, 1949).

The problem of balanced sampling is an old one and has not yet been solved. Kiaer

(1896), founder of modern sampling, argued for samples that match the means of known

variables to obtain what he called ‘representative samples’. He advocated purposive

methods before the development of the idea of probability sampling proposed by Neyman

(1934, 1938). Yates (1949) also insisted on the idea of respecting the means of known

variables in probability samples because the variance is then reduced. Yates (1946) and

Thionet (1953, pp. 203–7) have described limited and heavy methods of balanced sampling.

jek (1964; 1981, p. 157) gives a rigorous deﬁnition of a representative strategy and

its properties. According to Ha

jek, a strategy is a pair composed of a sampling design

and an estimator, the strategy being representative if it estimates exactly the total of

an auxiliary variable. He showed that a representative strategy could be achieved by

regression, but he did not succeed in ﬁnding a representative sampling method associated

894 J-C D  Y T



with the Horvitz–Thompson estimator other than the rejective procedure, which consists

of selecting new samples until a balanced sample is found. Royall & Herson (1973)

stressed the importance of balancing a sample in order to protect the inference against

a misspeciﬁed model. They called this idea ‘robustness’. Since no method existed for

achieving a multivariate balanced sample, they proposed the use of simple random

sampling, which is ‘mean-balanced’ with large samples. Several partial solutions were

proposed by Deville et al. (1988), Deville (1992), Ardilly (1991) and Hedayat & Majumdar

(1995), but a general solution for balanced sampling was never found. Recently, Valliant

et al. (2000) surveyed some existing methods.

In this paper, we propose a general method, the cube method, that allows the selection

of approximately balanced samples, in that the Horvitz–Thompson estimates for the

auxiliary variables are equal, or nearly equal, to their population totals. The method is

appropriate for a large set of qualitative or quantitative balancing variables, it allows

unequal inclusion probabilities, and it permits us to understand how accurately a

sample can be balanced. Moreover, the sampling design respects any ﬁxed, equal or

unequal, inclusion probabilities. The method can be viewed as a generalisation of the

splitting procedure (Deville & Tille

, 1998) which allows easy construction of new unequal

probability sampling methods.

Since its conception, the cube method has aroused great interest amongst survey

statisticians at the Institut National de la Statistique et des E

tudes E

conomiques (INSEE),

the French Bureau of Statistics. A ﬁrst application of the method was implemented in

SAS-IML by A. Bousabaa, J. Lieber, R. Sirolli and F. Tardieu. This macro allows the

selection of samples with unequal probabilities of up to 50 000 units and 30 balancing

variables. The INSEE has adopted the cube method for its most important statistical

projects. In the redesigned census in France, a ﬁfth of the municipalities with fewer than

5000 inhabitants are sampled each year, so that after ﬁve years all the municipalities will

be selected. All the households in these municipalities are surveyed. The ﬁve samples of

municipalities are selected with equal probabilities using the cube method and are balanced

on a set of demographic variables (Dumais & Isnard, 2000).

The demand for such sampling methods is very strong. In the French National Statistical

Institute, the use of balanced sampling in several projects improved eﬃciency dramatically,

allowing a reduction of the variance by 20 to 90% in comparison to simple random

sampling.

2. F   ,  

Consider a ﬁnite population U of size N whose units can be identiﬁed by labels

kµ{1,...,N}. The aim is to estimate the total Y = W

kµU

of a variable of interest y

that takes the values y

(kµU) for the units of the population. Suppose also that the

vectors of values x

=(x

...x

)∞ taken by p auxiliary variables are known for all

the units of the population. The p vectors (x

...x

)∞, for j=1,...,p, are assumed

without loss of generality to be linearly independent.

A sample is denoted by a vector s=(s

...,s

)∞, where s

takes the value 1

if k is in the sample and is 0 otherwise. A sampling design p(.) is a probability distri-

bution on the set S={0, 1}N of all the possible samples. The random sample S takes

the value s with probability pr (S=s)=p(s). The inclusion probability of unit k is the

probability p

=pr (S

=1) that unit k is in the sample and the joint inclusion probability

is the probability p

=pr (S

=1 and S

=1) that two distinct units are jointly in the

895EYcient balanced sampling: T he cube method

sample. The Horvitz–Thompson estimator given by Y

= W

kµU

is an unbiased esti-

mator of Y . The Horvitz–Thompson estimator of the jth auxiliary total X

= W

kµU

is X

= W

kµU

. The Horvitz–Thompson estimator vector, X

= W

kµU

estimates without bias the totals of the auxiliary variables, X= W

kµU

The aim is to construct a balanced sampling design, deﬁned as follows.

D 1. A sampling design p(s) is said to be balanced on the auxiliary variables,

,...,x

, if and only if it satisﬁes the balancing equations given by

=X, (1)

which can also be written as

∑

kµU

=∑

kµU

for all sµS such that p(s)>0.

Remark.Ifthey

are linear combinations of the x

, that is y

=x∞

b for all k, where b

is a vector of constants, then Y

=Y . More generally, if the y

are well predicted by a linear

combination of the x

, one can expect var (Y

) to be small.

Next consider the following three particular cases of balanced sampling.

Example 1. A sampling design of ﬁxed sample size n is balanced on the variable

(kµU) because

∑

kµU

=∑

kµU

=n.

Example 2. Suppose that the design is stratiﬁed and that, from each stratum U

(h=1,...,H)of size N

, a simple random sample of size n

is selected. Then the design

is balanced on the variables d

, where

1, if kµU

0, if k1U

In this case, we have

∑

kµU

=∑

kµU

(h=1,...,H).

Example 3. In sampling with unequal probabilities, when all the inclusion probabilities

are diﬀerent, the Horvitz–Thompson estimator N

= W

kµU

of the population size N

is generally random. When the population size is known before selecting the sample, it

could be important to select a sample such that

∑

kµU

=N. (2)

Equation (2) is a balancing equation, in which the balancing variable is x

=1(kµU).

Until now, there has been no method by which (2) can be approximately satisﬁed for

arbitrary inclusion probabilities, but we will see that this balancing equation can be

satisﬁed by means of the cube method.

896 J-C D  Y T



Stratiﬁcation and unequal probability sampling are thus special cases of balanced

sampling. In § 6, we present new cases, but the main practical interest of balanced sampling

lies in its generality. Nevertheless, in most cases, the balancing equations (1) cannot be

exactly satisﬁed, as the following example shows.

Example 4. Suppose that N=10, n=7, p

= 7

(kµU) and that the only auxiliary

variable is x

=k(kµU). Then a balanced sample satisﬁes

∑

kµU

=∑

kµU

so that W

kµU

has to be equal to 55× 7

=38·5, which is impossible because 38·5 is not

an integer. The problem arises because 1/p

is not an integer and the population size

is small.

Consequently, our objective is to construct a sampling design which satisﬁes the

balancing equations (1) exactly if possible, and to ﬁnd the best approximation if this

cannot be achieved. The rounding problem becomes negligible when the expected sample

size is large.

3. C    

The cube method is based on a geometric representation of the sampling design. The 2N

possible samples correspond to 2N vectors of RN in the following way. Each vector s is a

vertex of an N-cube, and the number of possible samples is the number of vertices of an

N-cube. A sampling design with inclusion probabilities p

(kµU) consists of assigning a

probability p(s) to each vertex of the N-cube such that

E(s)=∑

sµS

p(s)=p,

where p =(p

) is the vector of inclusion probabilities. Geometrically, a sampling design

consists of expressing the vector p as a convex combination of the vertices of the N-cube.

A sampling algorithm can thus be viewed as a ‘random’ way of reaching a vertex of the

N-cube from a vector p in such a way that the balancing equations (1) are satisﬁed.

Figure 1 shows the geometric representation of the possible samples from a population

of size N=3.

The cube method is composed of two phases called the ﬂight phase and the landing

phase. In the ﬂight phase, the constraints are exactly satisﬁed. The objective is to round

Fig. 1. Geometric representation of possible samples

in a population of size N=3.

897EYcient balanced sampling: T he cube method

oﬀ randomly to 0 or 1 almost all the inclusion probabilities. The landing phase consists

of coping as well as possible with the fact that the balancing equations (1) cannot always

be satisﬁed exactly.

The balancing equations (1) can also be written

∑

kµU

=∑

kµU

µ{0, 1}, kµU, (3)

where a

(kµU) and s

equals 1 if unit k is in the sample and 0 otherwise. The

ﬁrst equation of (3) with given a

and coordinates s

deﬁnes a hyperplane Q in RN of

dimension N−p. Note that Q=p+kerA, where kerA is the kernel or null-space of the

p×N matrix A given by A=(a

...a

). The main idea in obtaining a balanced

sample is to choose a vertex of the N-cube that remains in the hyperplane Q or near to Q

if that is not possible.

If C=[0, 1]N denotes the N-cube in RN whose vertices are the samples of U, the

intersection between C and Q is nonempty, because p is in the interior of C and belongs

to Q. The intersection between an N-cube and a hyperplane deﬁnes a polytope K=CmQ,

which is of dimension (N−p) because it is the intersection of an N-cube and a plane, of

dimension (N−p), that has a point in the interior of C.

D 2. L et D be a convex polyhedron. A vertex, or extremal point, of D is deﬁned

as a point that cannot be expressed as a convex linear combination of other points of D. T he

set of all the vertices of D is denoted by Ext (D).

D 3. A sample s is said to be exactly balanced if sµExt (C)mQ.

Note that a necessary condition for ﬁnding an exactly balanced sample is that

Ext (C)mQNB.

D 4. A balancing equation system is

(i) exactly satisﬁed if Ext (C)mQ=Ext (CmQ),

(ii) approximately satisﬁed if Ext (C)mQ=B,

(iii) sometimes satisﬁed if Ext (C)mQNExt (CmQ) and Ext (C)mQNB.

Whether the balancing equation system is exactly satisﬁed, approximately satisﬁed or

sometimes satisﬁed depends on the values of p and A.

P 1. If r =(r

) is a vertex of K then #{k|0<r

<1}∏p, where p is the number

of auxiliary variables, and #(B) denotes the cardinality of a set B.

Proof. Let A* be the submatrix of A consisting of the columns corresponding to non-

integer components of the vector r.Ifq=#(U*)>p, then kerA* has dimension q−p>0,

and r is not an extreme point of K. %

The following three examples show that the rounding problem can be viewed geo-

metrically. Indeed, the balancing equations cannot be exactly satisﬁed when the vertices

of K are not vertices of C, that is when q>0.

Example 5. In Fig. 2(a), a sampling design for a population of size N=3 is considered.

The only constraint consists of ﬁxing the sample size n=2, and thus p=1 and x

(kµU). The inclusion probabilities satisfy p

=2, so that the balancing equation

is exactly satisﬁed.

Efficient balanced sampling: The cube method

Figures

Citations

Handling class imbalance in customer churn prediction

Comparing Oversampling Techniques to Handle the Class Imbalance Problem: A Customer Churn Prediction Case Study

Socioeconomic impacts of COVID-19 in low-income countries.

TRIÈST: Counting Local and Global Triangles in Fully-Dynamic Streams with Fixed Memory Size

Geostatistical Model-Based Estimates of Schistosomiasis Prevalence among Individuals Aged ≤20 Years in West Africa

References

Model assisted survey sampling

On the Two Different Aspects of the Representative Method: the Method of Stratified Sampling and the Method of Purposive Selection

On the Two Different Aspects of the Representative Method: the Method of Stratified Sampling and the Method of Purposive Selection

On the Theory of Sampling from Finite Populations

Sampling methods for censuses and surveys

Related Papers (5)

Model assisted survey sampling

A generalization of sampling without replacement from a finite universe.

Calibration Estimators in Survey Sampling

Spatially Balanced Sampling of Natural Resources

Survey Design under the Regression Superpopulation Model

Frequently Asked Questions (11)

Q1. What have the authors contributed in "Efficient balanced sampling: the cube method" ?

Q2. What is the strategy for using balanced sampling?

Q3. What is the second way to sort the data?

Q4. What is the balancing variable in equation 2?

Q5. How can the authors implement the cube method?

Q6. What is the general method for detecting when the balancing equations are exactly satisfied?

Q7. What is the variance approximation for balanced sampling?

Q8. What is the way to solve the problem of sampling with unequal probability?

Q9. What is the first step in the generating of the vector u(t)?

Q10. What is the standard probability weighted estimator?

Q11. How many auxiliary variables can be used to calculate the variance?