Journal ArticleDOI

# A simple estimator for correlation in incomplete data

01 Dec 1996-Statistics & Probability Letters (North-Holland)-Vol. 31, Iss: 1, pp 51-57

AbstractIt seems fair that the usual sample correlation coefficient should allow improvement when additional samples from the marginals are available. However, some intuitive attempts fail badly. Using control variates, a simple method is presented which asymptotically achieves the optimal improvement.

Topics: Control variates (58%), Estimator (53%), Correlation coefficient (52%)

### 1. Introduction

• In Section 3 the authors return to the incomplete data case and demonstrate how the results from the previous section can be extended to the situation.
• The simple estimator proposed here is linked to the existing literature in Section 4.
• Moreover, some simulations are discussed, as well as a problem the authors recently encountered and which serves as an illustration for the use of methods like the one outlined here.

Did you find this useful? Give us your feedback

Content maybe subject to copyright    Report

II
ELSEVIER
Statistics & Probability Letters 31 (1996) 51-57
WrATIWrlOB &
PROBABIUTY
LETTERS
A simple estimator for correlation in incomplete data
W. Albers*, M.F.
Teulings
Department of Applied Mathematics, University of Twente, P.O. Box 217, 7500 AE Enschede, The Netherlands
Received October 1995; revised February 1996
Abstract
It seems fair that the usual sample correlation coefficient should allow improvement when additional samples from the
marginals are available. However, some intuitive attempts fail badly. Using control variates, a simple method is presented
which asymptotically achieves the optimal improvement.
Keywords." Missing data; Control variates; Asymptotic relative efficiency
Mathematics subject classification." 62F 12; 62H 20
1. Introduction
We consider the situation of a bivariate normal distribution: (X1,X2) ~ N(pl, #2, t r2, tr2, P). Suppose we have
a sample of size n on (X1,X2), and in addition samples of size mi on X/,i = 1,2. In other words, the total
sample size may be n + m] + m2, but we are handicapped by missing data with respect to both )(1 and X2. It
is in this situation that we are interested in estimating p = p(X1,X2).
Intuitively, it is clear that one should be able to do better here than with just the complete data subset of
size n: the additional separate observations on X,- should help. Of course, this intuition is correct. Nevertheless,
as we shall point out by, e.g., referring to Johnson and Kotz (1972), one is easily led astray in trying to
incorporate this additional information and may actually wind up by doing worse rather than better.
To make quite clear what goes wrong in such efforts and how this should be remedied, we shall begin by
considering in Section 2 the complete data case with known means and variances. Note that this case fits in
the framework above by letting mi ----r (3<3, i = 1,2. The key notion is that the sample variances act as control
variates for the sample covariance. Hence they should n6t be replaced simply by the actual variances even if
these are known. Such a replacement would actually lead to a worse estimator, rather than to a better one.
Optimal application of the control variate approach does lead to the desired improvement, however.
In Section 3 we return to the incomplete data case and demonstrate how the results from the previous
section can be extended to the situation. The simple estimator proposed here is linked to the existing literature
* Corresponding author.
PH S01 67-7152(96)00013-2

52
W. Albers, M.F Teulinos I Statistics & Probability Letters 31 (1996) 51-57
in Section 4. Moreover, some simulations are discussed, as well as a problem we recently encountered and
which serves as an illustration for the use of methods like the one outlined here.
2. The complete data case
Here we just have a complete sample of size n on (XI,X2) from the previous section• Then (cf. e.g. Johnson
and Kotz 1972, p. 104) the MLE of p is given by
/~ = S12/(S1S2),
(1)
n n n
where 52 = (n- 1) -1 ~--]~j=[
(Xij
_~i)2, Xi =
n -1
Ej=I
XO' i = 1,2 and ,~12 = (n- 1) -1 ~j=l (Xo -
X1 )(X2j - )~2 ).
As is well-known (cf. e.g. Settling 1980, p. 125), n1/2(/~- p) is asymptotically normal with mean zero
and variance z 2 = lim,~o~(n vat R). Comparison of k to possible competitors in case of known means and
variances will be based on this variance z 2 of the limiting distribution. This enables us to simplify matters in
the following sense. It is straightforward to verify that v2 remains the same to first order if we replace the
sample means )(i by/~i, i = 1,2, in (1). Hence, the effect of the means being either known or unknown, is
negligible. Consequently, in the sequel we shall simply assume
Pi
to be known. Without loss of generality,
these known values can both be assumed to be zero. Hence instead of k from (1) we can take as our starting
point the more simple
R = S12/(\$1S2) ,
(2)
where # =n -I ~j=l
X2
and
S,2
=n -1Ej=IX1jX2j.
For the variances, matters are less straightforward: it now makes a difference whether these are known or
not. In fact, this precisely motivates the consideration of possible alternatives for R. Nevertheless, in analyzing
the behavior of these estimators, we can always, without loss of generality, assume that a] = 1, i = l, 2. Hence,
we shall simply let
(X1,X2) ~-
N(0,0, 1, 1,p) and, depending on the situation, assume the fact that ~r 2 = 1 to
be known or not.
First consider the standard case where the a/2 are unknown and R is the MLE for p (cf. again Johnson and
Kotz 1972, p. 104), Letting V =
\$12 -p
and U,. = S ] - 1, i = 1,2, we obtain that
2 2
R = (p + V) I-[ (1
+
Ui) -1/2 "
p "-~ V
-- lp Z U},
(3)
i=1 i=1
where "-" stands for equality to first order• It is easy to verify that
2
nEV2=lWp 2, nEVUi=2p, nE(~-~i=l gi) =4(1
+P2)" (4)
" 1 pZE(~-'2
Ui)2) =
This in its turn implies that z 2 n var
(R) n MSE (R) n{ EV2
-- P
~-~-~=1
EVUi + ~
L.~i=I
(1 --
p2)2, which agrees with Johnson and Kotz (1972, p. 104)•
Suppose now, the fact that a~ --- 1 is known. At first sight it may seem indicated in this situation to replace
R from (2) by
R(O)
= S12/(ala2) =
S12. (5)
But from (3) and (4), it is immediate that the variance of the limiting distribution of
nl/2(R(O)- p)
equals
1 + p2. This is considerably larger, instead of smaller, than (1 - p2)2 = z2.

W. Albers, M.F Teulings I Statistics & Probability Letters 31 (1996) 51-57
53
The explanation for this phenomenon lies in the fact that the
Si
act as control variates, and as such are more
useful than the known values
ai.
Briefly, let
Ti
be unbiased estimators for
Oi, i
=
1,2, with/5 =
p(Tl, I"2) ~ O.
If 02 is known, T1 can be improved upon by 7"3 = T1 + 7(/2 - 02) as an estimator for al by choosing the
coefficient 7 of the control variate in the right way. In fact, it is straightforward that for ~ =
-~a(T1)/a(T2)
we achieve the minimal value a2(T3)= ~2(T1)(1 -fi2) < a2(T1).
Application of this technique here leads to the proposal:
R(7) = \$12
ala2 H (S2i /a~)r/2 "
(6)
i=1
Note that R(1) ---- R from (2), the sole possible choice for unknown al, whereas
R(O)
in (6) indeed agrees
with R(0) from (5). Using the fact that the known values of
ai
are both equal to one, R(7) reduces to
\$12/{ I-[~=l \$2} "~/2,
which in analogy to (3) satisfies R(7) " p + V - ½yp ~=1 Ui. Together with (4), this
immediately leads to an expression for the variance of the limiting distribution of
nl/2(R(7) - p),
which is
easily seen to be minimized for
2 with n var (R(7*)) " (1
- p2)2
7* - 1 + p2' 1 + p~" (7)
As 1 ~7" ~<2, it is seen that the known ai are not used to replace the S~, but rather to add more weight to
these control variates. Indeed the choice R(0) literally goes into the wrong direction. To apply the method,
we need to replace y* from (7) by its estimated counterpart i* = 2/(1
+
R2(0)). It is again straightforward
to verify that this change is negligible as far as the asymptotic variance is concerned.
From (7) it follows that
ARE(R(~*),R)
= 1 + p2, (8)
which shows that the improvement can be quite worthwhile. To put the result in (8) in its proper perspective,
it is desirable to compare the performance of R(7*) to that of an efficient estimator such as the MLE. In fact,
Johnson and Kotz (1972) on p. 105 discuss the MLE for this case of known parameters for the marginal
distributions. However, it turns out to be rather complicated, requiring the solution of a cubic equation.
Therefore, Johnson and Kotz also give a first-order version which they advocate for many practical purposes.
Now observe that the estimator corresponding to i* mentioned above satisfies
{ (~~)} (1
R(O) _\$2 S~)"
(9)
R(~*)--R(0) 1-(i*/2) (S 2- 1) =R(0)+ +R2(0))(2 -
i=l
But this latter expression precisely agrees with the aforementioned first-order version of the MLE from Johnson
and Kotz. Hence, R(~*) not only improves on R = R(1) as shown in (8), but moreover it is asymptotically
efficient.
A final remark concerns the following. Application of MLEs automatically ensures that the estimators of p
do not fall outside the parameter space [-1, 1]. If desired, a similar restriction can be imposed on first-order
versions like R(~*) or the one in (9).
3. Missing observations
After considering the limiting case in the previous section, we now address the general case, where the
variances are not assumed to be known, but we have samples of size
mi
on X/, i = 1,2, in addition to
the complete sample of size n on
(X1,X2).
Just as in Section 2, the feeling arises that R from (2) should

54
W. Albers, M.F. Teulings I Statistics & Probability Letters 31 (1996) 51-57
be improved upon, this time by utilizing the information for the
(ml
However, in this general case, even more care is needed in such attempts. For example, Johnson and Kotz
(1972, p. 108) mention a possibility, but more or less discard it again by pointing out that its possible increase
of accuracy must be judged against the possibility of obtaining values outside [-1, 1].
We shall use the following setup and notation. Suppose we have
Xij
for i = 1,2, j = 1,..., n and in
addition for i = 1, j = n + 1 ..... n + ml and i = 2, j = n + ml + 1 .... ,n + ml + m2. Let
2 n+mt n+m~ +m2
X :
Z, Z = Z, X '2, = Z, 2,- __n,/ = 1,2 (10)
n +mi
j j=l j j=n+l
j j=n+rnt+l
For our purpose we are typically interested in constant 2i (apart from rounding effects). Hence, we shall
suppose that
mi,
i = 1,2, and n tend to infinity at a common rate. From Section 2 it is clear that we should
not use the additional observations to simply replace Si, i = 1,2 by new sample variances based on
(n +mi)
observations. For larger values of
mi/n,
the resulting estimator would behave more and more like R(0), which
was shown to be an extremely bad choice.
Fortunately, Section 2 also suggests which is the right direction to take: we should mimic the behavior of
R(7) from (6)• In addition to S~ = n -1
Ej=I
X2",
let (S/) 2 - -lx-~(i)
n : m i ~..~j X/2j and
Vi = S 2 -
l, U[ = (S[) 2 - 1,
i = 1,2. We replace (6) by
R(T1,72 ) = S12 / {(S~ )(l-~,, )S1 v, (St)(1-~2 )S~2 }. (
11 )
1 2
By analogy to (3) we obtain that this R(71,72)
p + V - iP(~i=l
{(1
- y~)U[ + yiU~}).
To evaluate the
variance of the limiting distribution of
nl/2(R(71,?2)- p),
some results are needed in addition to (4). To
begin with, we observe that
EVU[ = EUiU" = EU[U~
= 0. Moreover,
nEUi 2 = 2, nEUlU2
= 2p 2 and
E(U/) 2 : 2/mi
= 22i/(1 - 2i). Together with (11), this shows that n var R(yl,72) --
(l +/) - 2p 2 + p2 + g + (1 - - 4,)) . (12)
i=1 i=1
The expression in (12) is easily seen to be minimized for
7* = 2 - 2i - p2(1 - 2i)(2
-- 23_i)
(13)
i 1 -p4(1 -21)(1 -22) '
i = 1,2, whereas the corresponding minimal value satisfies
R * * " (1-p2)2
nvar (y1,~2) 1 -p4(1 -21)(1 -22) [1 - ½P2(2- 21 -22)}. (14)
Note that for the special case of equal numbers of additional observations, (13) and (14) simplify to
7* - 2 - 2 (1 - pZ)2
1 + p2(1 - 2)' n var R(7',7") -- 1 +
p2(1 -
2)' (15)
with 21 = 22 : 2 and 7~ = 7~' : 7*. Also observe that the variance expression(s) in (14) and (15) produce
the values z 2 = (1
_p2)2
and (1 -p2)2/(1
+p2)
for 2i = 1 and 2i = 0, respectively. Hence, the result in (14)
indeed fills the gap between the two extremes considered in Section 2. If only a small fraction of additional
observations is available, the 2i are close to 1, the Yi* from (13) are near one, and the variance of the limiting
distribution is close to z 2. Likewise, for relatively large numbers of additional observations, the 2~ are close
to 0 and we approach the result for the asymptotic variance from (7).

W. Albers, M.F. Teulings I Statistics & Probability Letters 31 (1996) 51-57
55
To make the estimator applicable in practice, we replace yi* from (13) by ~* by substituting R for p. The
difference caused is again negligible and it follows that (8) generalizes to
ARE(R(~*,~),R) = 1 - p4(1 - 21)(1 - 22) (16)
1
- ½p2(2 - 21 - 22) '
which reduces to ARE(R(5*,5*),R) = 1 + p2(1 -2) for 2 = 21 = 22. This surprisingly simple relation
adequately makes clear the gain to be achieved by the optimal use of additional information.
4. Concluding remarks
In searching the literature for related approaches, we came across Hocking and Smith (1968) (see also Smith,
1968). Here initial estimators are based on the complete data, after which these estimators are improved step-
wise by taking the subsequent groups of incomplete observations into account. At each step, the improvement
is achieved by adding a suitable multiple of the difference between the estimator of a parameter based on the
complete data, and the estimator of the same parameter based on the group of incomplete data considered
in this step. As R(71,72 ) from (11) satisfies R(yl,72) " R(1, 1)(1 - ½ }--~=l (1
-]2i){(S[)
2 -\$2}), a strong
resemblance to the method suggested by Hocking and Smith (1968) suggests itself. Indeed, explicit evaluation
reveals that both methods are in fact equivalent to first order and, hence, in particular, produce the same effi-
ciency figures. As the actual evaluation is tedious but straightforward, we refer to Teulings (1994) for details.
In comparing the two methods, we note that the method we propose here is much easier to implement and to
understand.
A second topic we briefly mention is a small simulation study, performed to obtain an impression of how
the first-order approximations work out for finite sample sizes. Again we refer to Teulings (1994) for full
details. Of course, we now work again with R, ~2 and Slz from (1), rather than with R, S/2 and Sl2 from (2).
Likewise, we apply
(~)2 __
(mi
--
1 )-1
z_.,j~-'~(i) (Yij _ Xi-t)2,
where
X i-t = mi
-1v"~(i)2_...,j
Xij.
In Table 1 we present results for p = 0.70, n = 20 and n = 100 complete observations, and equal
numbers ml = m2 = m of additional observations, running from m = 10 to m = 2000. Using 1000
simulations on each run, we obtain estimates ER, f'~ R, Ek(~*,~*) and ~ k(~*,~*) (where k(~*,~*)
is obtained from R(~*,~*) replacing Si by Si and S' by S', i = 1,2). Another reason for performing
the simulations is to compare the present estimator to the one mentioned above, which was introduced
by Hocking and Smith (1968). Hence in Table 1 we also list the values Ek/~s and 9"~k~/&s, obtained
for this estimator. Finally, the values based on the asymptotics to which we want to compare are A V1 =
n-lz 2 = (1 - pZ)Z/n and AV2 =- (1 - p2)2{1 - ½P2(2 - 21 - 22)}/{(1 - p4(1 - 21)(1 - 22))n} (cf.
(14)), respectively. These are given as well: AVt for comparison to O'~ k and AV2 for both ~/~(~*,'2")
and ~ kH~s.
From Table 1 we observe that the value n = 20 is still quite small: the values of ~ k do not get very
close to A/I1. As a consequence, there also remains a bit of a gap between ~ k(~;*, ~*) and ~ krises
on the one hand, and AV2 on the other. For n = 100, the picture is much better: the values of ~ /~
are quite close to A V1 ---- 0.0026. Moreover, the behavior of the other two variances is now adequately
approximated by AV2. Also note that the figures for n = 100, nicely illustrate the result on the ARE in
(16), according to which the efficiency values AV1/AV2 in this example are 1 + 0.49m/(100 + m). Finally,
observe that there is little difference in performance between k(~*,~*) and k,v~s. Hence, the simulations
suggest no preference for either approach. Thus, we uphold the point of view expressed above, according to
which the present method has the advantage of being more simple, both with respect to computing and to
understanding it.

##### Citations
More filters

01 Jan 1996
Abstract: Causal and semicausal stochastic elds may be rewritten in a state-space representation. Local changes in these elds may be regarded as time-varying changes in the parameters of this model. Using a Kalman lter bank, the generalized likelihood ratio may be computed if these changes are known to be in a certain set. If only some scaling parameters are unknown, exact expressions for the generalized likelihood ratio may be found.

6 citations

##### References
More filters

Journal ArticleDOI

1,866 citations

Journal ArticleDOI

867 citations

Journal ArticleDOI

174 citations

Journal ArticleDOI
Abstract: In this paper a method is developed for estimating the parameters in the multivariate normal distribution in which the missing observavations are not restricted to follow certain patterns as in most previous papers. The large sample properties of the estimators are discussed. Equivalence with maximum likelihood estimators has been established for a subclass of problems. The results of some simulation studies are provided to support the theoretical development.

93 citations

Journal ArticleDOI
TL;DR: An economic model for this screening procedure is developed with the consideration of the cost incurred by imperfect quality and the cost associated with the disposition of the rejected items.
Abstract: In a complete inspection (screening) procedure, all of the items are subject to acceptance inspection. If an item fails to meet the predetermined inspection specifications, it is rejected and excluded from shipment. When the inspection on the performance variable is destructive or very costly, it may be economical to use another variable that is correlated with the performance variable and is relatively inexpensive to measure as the screening variable. Suppose that the performance variable is a “larger is better” variable and is positively correlated with the screening variable, The items for which the measured values of the screening variable are smaller than the screening specifications are rejected and excluded from shipment. An economic model for this screening procedure is developed with the consideration of the cost incurred by imperfect quality and the cost associated with the disposition of the rejected items. Solution procedures for the optima1 screening specifications are developed for three qua...

59 citations