II

ELSEVIER

Statistics & Probability Letters 31 (1996) 51-57

WrATIWrlOB &

PROBABIUTY

LETTERS

A simple estimator for correlation in incomplete data

W. Albers*, M.F.

Teulings

Department of Applied Mathematics, University of Twente, P.O. Box 217, 7500 AE Enschede, The Netherlands

Received October 1995; revised February 1996

Abstract

It seems fair that the usual sample correlation coefficient should allow improvement when additional samples from the

marginals are available. However, some intuitive attempts fail badly. Using control variates, a simple method is presented

which asymptotically achieves the optimal improvement.

Keywords." Missing data; Control variates; Asymptotic relative efficiency

Mathematics subject classification." 62F 12; 62H 20

1. Introduction

We consider the situation of a bivariate normal distribution: (X1,X2) ~ N(pl, #2, t r2, tr2, P). Suppose we have

a sample of size n on (X1,X2), and in addition samples of size mi on X/,i = 1,2. In other words, the total

sample size may be n + m] + m2, but we are handicapped by missing data with respect to both )(1 and X2. It

is in this situation that we are interested in estimating p = p(X1,X2).

Intuitively, it is clear that one should be able to do better here than with just the complete data subset of

size n: the additional separate observations on X,- should help. Of course, this intuition is correct. Nevertheless,

as we shall point out by, e.g., referring to Johnson and Kotz (1972), one is easily led astray in trying to

incorporate this additional information and may actually wind up by doing worse rather than better.

To make quite clear what goes wrong in such efforts and how this should be remedied, we shall begin by

considering in Section 2 the complete data case with known means and variances. Note that this case fits in

the framework above by letting mi ----r (3<3, i = 1,2. The key notion is that the sample variances act as control

variates for the sample covariance. Hence they should n6t be replaced simply by the actual variances even if

these are known. Such a replacement would actually lead to a worse estimator, rather than to a better one.

Optimal application of the control variate approach does lead to the desired improvement, however.

In Section 3 we return to the incomplete data case and demonstrate how the results from the previous

section can be extended to the situation. The simple estimator proposed here is linked to the existing literature

* Corresponding author.

0167-7152/96/$12.00 © 1996 Elsevier Science B.V. All rights reserved

PH S01 67-7152(96)00013-2

52

W. Albers, M.F Teulinos I Statistics & Probability Letters 31 (1996) 51-57

in Section 4. Moreover, some simulations are discussed, as well as a problem we recently encountered and

which serves as an illustration for the use of methods like the one outlined here.

2. The complete data case

Here we just have a complete sample of size n on (XI,X2) from the previous section• Then (cf. e.g. Johnson

and Kotz 1972, p. 104) the MLE of p is given by

/~ = S12/(S1S2),

(1)

n n n

where 52 = (n- 1) -1 ~--]~j=[

(Xij

_~i)2, Xi =

n -1

Ej=I

XO' i = 1,2 and ,~12 = (n- 1) -1 ~j=l (Xo -

X1 )(X2j - )~2 ).

As is well-known (cf. e.g. Settling 1980, p. 125), n1/2(/~- p) is asymptotically normal with mean zero

and variance z 2 = lim,~o~(n vat R). Comparison of k to possible competitors in case of known means and

variances will be based on this variance z 2 of the limiting distribution. This enables us to simplify matters in

the following sense. It is straightforward to verify that v2 remains the same to first order if we replace the

sample means )(i by/~i, i = 1,2, in (1). Hence, the effect of the means being either known or unknown, is

negligible. Consequently, in the sequel we shall simply assume

Pi

to be known. Without loss of generality,

these known values can both be assumed to be zero. Hence instead of k from (1) we can take as our starting

point the more simple

R = S12/($1S2) ,

(2)

where # =n -I ~j=l

X2

and

S,2

=n -1Ej=IX1jX2j.

For the variances, matters are less straightforward: it now makes a difference whether these are known or

not. In fact, this precisely motivates the consideration of possible alternatives for R. Nevertheless, in analyzing

the behavior of these estimators, we can always, without loss of generality, assume that a] = 1, i = l, 2. Hence,

we shall simply let

(X1,X2) ~-

N(0,0, 1, 1,p) and, depending on the situation, assume the fact that ~r 2 = 1 to

be known or not.

First consider the standard case where the a/2 are unknown and R is the MLE for p (cf. again Johnson and

Kotz 1972, p. 104), Letting V =

$12 -p

and U,. = S ] - 1, i = 1,2, we obtain that

2 2

R = (p + V) I-[ (1

+

Ui) -1/2 "

p "-~ V

-- lp Z U},

(3)

i=1 i=1

where "-" stands for equality to first order• It is easy to verify that

2

nEV2=lWp 2, nEVUi=2p, nE(~-~i=l gi) =4(1

+P2)" (4)

• • " 1 pZE(~-'2

Ui)2) =

This in its turn implies that z 2 n var

(R) n MSE (R) n{ EV2

-- P

~-~-~=1

EVUi + ~

L.~i=I

(1 --

p2)2, which agrees with Johnson and Kotz (1972, p. 104)•

Suppose now, the fact that a~ --- 1 is known. At first sight it may seem indicated in this situation to replace

R from (2) by

R(O)

= S12/(ala2) =

S12. (5)

But from (3) and (4), it is immediate that the variance of the limiting distribution of

nl/2(R(O)- p)

equals

1 + p2. This is considerably larger, instead of smaller, than (1 - p2)2 = z2.

W. Albers, M.F Teulings I Statistics & Probability Letters 31 (1996) 51-57

53

The explanation for this phenomenon lies in the fact that the

Si

act as control variates, and as such are more

useful than the known values

ai.

Briefly, let

Ti

be unbiased estimators for

Oi, i

=

1,2, with/5 =

p(Tl, I"2) ~ O.

If 02 is known, T1 can be improved upon by 7"3 = T1 + 7(/2 - 02) as an estimator for al by choosing the

coefficient 7 of the control variate in the right way. In fact, it is straightforward that for ~ =

-~a(T1)/a(T2)

we achieve the minimal value a2(T3)= ~2(T1)(1 -fi2) < a2(T1).

Application of this technique here leads to the proposal:

R(7) = $12

ala2 H (S2i /a~)r/2 "

(6)

i=1

Note that R(1) ---- R from (2), the sole possible choice for unknown al, whereas

R(O)

in (6) indeed agrees

with R(0) from (5). Using the fact that the known values of

ai

are both equal to one, R(7) reduces to

$12/{ I-[~=l $2} "~/2,

which in analogy to (3) satisfies R(7) " p + V - ½yp ~=1 Ui. Together with (4), this

immediately leads to an expression for the variance of the limiting distribution of

nl/2(R(7) - p),

which is

easily seen to be minimized for

2 with n var (R(7*)) " (1

- p2)2

7* - 1 + p2' 1 + p~" (7)

As 1 ~7" ~<2, it is seen that the known ai are not used to replace the S~, but rather to add more weight to

these control variates. Indeed the choice R(0) literally goes into the wrong direction. To apply the method,

we need to replace y* from (7) by its estimated counterpart i* = 2/(1

+

R2(0)). It is again straightforward

to verify that this change is negligible as far as the asymptotic variance is concerned.

From (7) it follows that

ARE(R(~*),R)

= 1 + p2, (8)

which shows that the improvement can be quite worthwhile. To put the result in (8) in its proper perspective,

it is desirable to compare the performance of R(7*) to that of an efficient estimator such as the MLE. In fact,

Johnson and Kotz (1972) on p. 105 discuss the MLE for this case of known parameters for the marginal

distributions. However, it turns out to be rather complicated, requiring the solution of a cubic equation.

Therefore, Johnson and Kotz also give a first-order version which they advocate for many practical purposes.

Now observe that the estimator corresponding to i* mentioned above satisfies

{ (~~)} (1

R(O) _$2 S~)"

(9)

R(~*)--R(0) 1-(i*/2) (S 2- 1) =R(0)+ +R2(0))(2 -

i=l

But this latter expression precisely agrees with the aforementioned first-order version of the MLE from Johnson

and Kotz. Hence, R(~*) not only improves on R = R(1) as shown in (8), but moreover it is asymptotically

efficient.

A final remark concerns the following. Application of MLEs automatically ensures that the estimators of p

do not fall outside the parameter space [-1, 1]. If desired, a similar restriction can be imposed on first-order

versions like R(~*) or the one in (9).

3. Missing observations

After considering the limiting case in the previous section, we now address the general case, where the

variances are not assumed to be known, but we have samples of size

mi

on X/, i = 1,2, in addition to

the complete sample of size n on

(X1,X2).

Just as in Section 2, the feeling arises that R from (2) should

54

W. Albers, M.F. Teulings I Statistics & Probability Letters 31 (1996) 51-57

be improved upon, this time by utilizing the information for the

(ml

+ m2) additional single observations.

However, in this general case, even more care is needed in such attempts. For example, Johnson and Kotz

(1972, p. 108) mention a possibility, but more or less discard it again by pointing out that its possible increase

of accuracy must be judged against the possibility of obtaining values outside [-1, 1].

We shall use the following setup and notation. Suppose we have

Xij

for i = 1,2, j = 1,..., n and in

addition for i = 1, j = n + 1 ..... n + ml and i = 2, j = n + ml + 1 .... ,n + ml + m2. Let

2 n+mt n+m~ +m2

X :

Z, Z = Z, X '2, = Z, 2,- __n,/ = 1,2 (10)

n +mi

j j=l j j=n+l

j j=n+rnt+l

For our purpose we are typically interested in constant 2i (apart from rounding effects). Hence, we shall

suppose that

mi,

i = 1,2, and n tend to infinity at a common rate. From Section 2 it is clear that we should

not use the additional observations to simply replace Si, i = 1,2 by new sample variances based on

(n +mi)

observations. For larger values of

mi/n,

the resulting estimator would behave more and more like R(0), which

was shown to be an extremely bad choice.

Fortunately, Section 2 also suggests which is the right direction to take: we should mimic the behavior of

R(7) from (6)• In addition to S~ = n -1

Ej=I

X2",

let (S/) 2 - -lx-~(i)

n : m i ~..~j X/2j and

Vi = S 2 -

l, U[ = (S[) 2 - 1,

i = 1,2. We replace (6) by

R(T1,72 ) = S12 / {(S~ )(l-~,, )S1 v, (St)(1-~2 )S~2 }. (

11 )

• 1 2

By analogy to (3) we obtain that this R(71,72)

p + V - iP(~i=l

{(1

- y~)U[ + yiU~}).

To evaluate the

variance of the limiting distribution of

nl/2(R(71,?2)- p),

some results are needed in addition to (4). To

begin with, we observe that

EVU[ = EUiU" = EU[U~

= 0. Moreover,

nEUi 2 = 2, nEUlU2

= 2p 2 and

E(U/) 2 : 2/mi

= 22i/(1 - 2i). Together with (11), this shows that n var R(yl,72) --

(l +/) - 2p 2 + p2 + g + (1 - - 4,)) . (12)

i=1 i=1

The expression in (12) is easily seen to be minimized for

7* = 2 - 2i - p2(1 - 2i)(2

-- 23_i)

(13)

i 1 -p4(1 -21)(1 -22) '

i = 1,2, whereas the corresponding minimal value satisfies

R * * " (1-p2)2

nvar (y1,~2) 1 -p4(1 -21)(1 -22) [1 - ½P2(2- 21 -22)}. (14)

Note that for the special case of equal numbers of additional observations, (13) and (14) simplify to

7* - 2 - 2 (1 - pZ)2

1 + p2(1 - 2)' n var R(7',7") -- 1 +

p2(1 -

2)' (15)

with 21 = 22 : 2 and 7~ = 7~' : 7*. Also observe that the variance expression(s) in (14) and (15) produce

the values z 2 = (1

_p2)2

and (1 -p2)2/(1

+p2)

for 2i = 1 and 2i = 0, respectively. Hence, the result in (14)

indeed fills the gap between the two extremes considered in Section 2. If only a small fraction of additional

observations is available, the 2i are close to 1, the Yi* from (13) are near one, and the variance of the limiting

distribution is close to z 2. Likewise, for relatively large numbers of additional observations, the 2~ are close

to 0 and we approach the result for the asymptotic variance from (7).

W. Albers, M.F. Teulings I Statistics & Probability Letters 31 (1996) 51-57

55

To make the estimator applicable in practice, we replace yi* from (13) by ~* by substituting R for p. The

difference caused is again negligible and it follows that (8) generalizes to

ARE(R(~*,~),R) = 1 - p4(1 - 21)(1 - 22) (16)

1

- ½p2(2 - 21 - 22) '

which reduces to ARE(R(5*,5*),R) = 1 + p2(1 -2) for 2 = 21 = 22. This surprisingly simple relation

adequately makes clear the gain to be achieved by the optimal use of additional information.

4. Concluding remarks

In searching the literature for related approaches, we came across Hocking and Smith (1968) (see also Smith,

1968). Here initial estimators are based on the complete data, after which these estimators are improved step-

wise by taking the subsequent groups of incomplete observations into account. At each step, the improvement

is achieved by adding a suitable multiple of the difference between the estimator of a parameter based on the

complete data, and the estimator of the same parameter based on the group of incomplete data considered

in this step. As R(71,72 ) from (11) satisfies R(yl,72) " R(1, 1)(1 - ½ }--~=l (1

-]2i){(S[)

2 -$2}), a strong

resemblance to the method suggested by Hocking and Smith (1968) suggests itself. Indeed, explicit evaluation

reveals that both methods are in fact equivalent to first order and, hence, in particular, produce the same effi-

ciency figures. As the actual evaluation is tedious but straightforward, we refer to Teulings (1994) for details.

In comparing the two methods, we note that the method we propose here is much easier to implement and to

understand.

A second topic we briefly mention is a small simulation study, performed to obtain an impression of how

the first-order approximations work out for finite sample sizes. Again we refer to Teulings (1994) for full

details. Of course, we now work again with R, ~2 and Slz from (1), rather than with R, S/2 and Sl2 from (2).

Likewise, we apply

(~)2 __

(mi

--

1 )-1

z_.,j~-'~(i) (Yij _ Xi-t)2,

where

X i-t = mi

-1v"~(i)2_...,j

Xij.

In Table 1 we present results for p = 0.70, n = 20 and n = 100 complete observations, and equal

numbers ml = m2 = m of additional observations, running from m = 10 to m = 2000. Using 1000

simulations on each run, we obtain estimates ER, f'~ R, Ek(~*,~*) and ~ k(~*,~*) (where k(~*,~*)

is obtained from R(~*,~*) replacing Si by Si and S' by S', i = 1,2). Another reason for performing

the simulations is to compare the present estimator to the one mentioned above, which was introduced

by Hocking and Smith (1968). Hence in Table 1 we also list the values Ek/~s and 9"~k~/&s, obtained

for this estimator. Finally, the values based on the asymptotics to which we want to compare are A V1 =

n-lz 2 = (1 - pZ)Z/n and AV2 =- (1 - p2)2{1 - ½P2(2 - 21 - 22)}/{(1 - p4(1 - 21)(1 - 22))n} (cf.

(14)), respectively. These are given as well: AVt for comparison to O'~ k and AV2 for both ~/~(~*,'2")

and ~ kH~s.

From Table 1 we observe that the value n = 20 is still quite small: the values of ~ k do not get very

close to A/I1. As a consequence, there also remains a bit of a gap between ~ k(~;*, ~*) and ~ krises

on the one hand, and AV2 on the other. For n = 100, the picture is much better: the values of ~ /~

are quite close to A V1 ---- 0.0026. Moreover, the behavior of the other two variances is now adequately

approximated by AV2. Also note that the figures for n = 100, nicely illustrate the result on the ARE in

(16), according to which the efficiency values AV1/AV2 in this example are 1 + 0.49m/(100 + m). Finally,

observe that there is little difference in performance between k(~*,~*) and k,v~s. Hence, the simulations

suggest no preference for either approach. Thus, we uphold the point of view expressed above, according to

which the present method has the advantage of being more simple, both with respect to computing and to

understanding it.