Aggregated estimating equation estimation

doi:10.4310/SII.2011.V4.N1.A8

Statistics and Its Interface Volume 4 (2011) 73–83

Nan Lin

∗

and Ruibin Xi

†

Motivated by the recent active research on online ana-

lytical processing (OLAP), we develop a computation and

storage eﬃcient algorithm for estimating equation (EE) es-

timation in massive data sets using a “divide-and-conquer”

strategy. In each partition of the data set, we compress the

raw data into some low dimensional statistics and then dis-

card the raw data. Then, we obtain an approximation to the

EE estimator, the aggregated EE (AEE) estimator, by solv-

ing an equation aggregated from the saved low dimensional

statistics in all partitions. Such low dimensional statistics

are taken as the EE estimates and ﬁrst-order derivatives of

the estimating equations in each partition.

We show that, under proper partiti oni ng and some regu-

larity conditions, the AEE estimator is strongly consistent

and asymptotically equivalent to the E E estimator. A major

application of the AEE technique is to support fast OLAP

of EE estimations for data warehousing technologies such

as data cubes and data streams. It can also be used to re-

duce the computation time and conqu er the memory con-

straint problem posed by massive data sets. Simulation stud-

ies show that the AEE estimator provides eﬃcient storage

and remarkable deduction in computational t ime, especially

in its applications to data cubes and data streams.

Keywords and phrases: Massive data sets, Estimat-

ing equation, D ata compression, Aggregation, Consistency,

Asymptotic normality, Data cube.

1. INTRODUCTION

Two major challenges in analyzing massive data sets are

storage and computational eﬃciency. In recent years, there

have been active researches on developing compression and

aggregation schemes to support fast online analytical pro-

cessing (OLAP) of various statistical analyses, such as linear

regression [

7, 14], general multiple linear regression [6, 19],

logistic regression analysis [

26], predictive ﬁlters [6], naive

Bayesian classiﬁers [

4] and linear discriminant analysis [22].

The OLAP analysis is usually associated with data ware-

housing technologies such as data cubes [

1, 12, 27] and data

streams [

16, 21], where fast responses to queries are often

needed. The response time of any OLAP tool should be in

∗

Corresponding author.

†

This work was done when Ruibin Xi was a PhD student in the De-

partment of Mathematics, Washington University in St. Louis.

the order of seconds, at most minutes, even if complex sta-

tistical analyses are involved.

Most of the current OLAP t ools can only supp ort simple

analyses that are essentially linear operators [

7, 6, 14, 19].

However, many advanced statistical analyses are nonlinear

and thus most of the current OLAP tools cannot be used

to support these advanced analyses. In this paper, we de-

veloped a compression and aggregation strat egy to support

fast OLAP analysis for estimating equation (EE) estima-

tors. The EE estimators are a very large family of esti-

mators and many statistical estimation techniques can be

uniﬁed into the framework of EE estimators, including the

ordinary least square (OLS) estimator, the quasi-likelihood

estimator (QLE) [25] and the robust M-estimator [17]. The

scheme developed in this paper can not only support fast

OLAP of EE estimation, but also can be used to reduce the

computation time of the EE estimates and solve the memory

constraint problem imposed by massive data sets.

The compression and aggregation technique developed in

this paper is based on the “divide-and-conquer” strategy. We

ﬁrst partition the massive data sets into K subsets and then

compress the raw data into the EE estimates and the ﬁrst-

order derivative of the estimating equation before discarding

the raw data. The saved statistics allow reconstructing an

approximation to the original estimating equation in each

subset, and hence an approximation to the equation f or the

entire data set after aggregating over all subsets. We show in

theory that the proposed aggregated EE (AEE) estimator is

asymptotically equivalent to th e EE estimator if the number

of partitions K does not go to inﬁnity too fast. Simulation

studies validate the theory and show that the AEE estimator

is computationally very eﬃcient. Our results also show that

the AEE estimator provides more accurate estimates than

estimates from a subsample of the entire data set, which is

commonly done for static massive data sets.

The remainder of the paper is organized as follows. We

ﬁrst review regression cube [

6] in Section 2 and then present

the AEE estimator in Section 3 with its asymptotic proper-

ties given in Section 4. In Section 5, we study the application

of the AEE estimator to QLE and provide asymptotic prop-

erties for the resulted aggregated QLE. Sections 6 and 7

study the performance of the AEE estimator and its appli-

cations to data cub es and data streams through simulation

studies. And at last, Section 8 concludes the paper and pro-

vides some discussion. All proofs are given in the Appendix.

2. AGGREGATION FOR LINEAR

REGRESSION

In this section, we review the regression cube technique

[

6] to illustrate the idea of aggregation for linear regression

analysis.

Suppose that we have N independent observations

(y

1

, x

1

),...,(y

N

, x

N

), where y

i

is a scalar response, x

i

is

a p ×1 covariate vector, i =1,...,N.Lety =(y

1

,...,y

N

)

T

and X =(x

1

,...,x

N

)

T

. A linear regression model assumes

that E(y)=Xβ. Suppose that X

T

X is invertible, the

OLS estimator of β is

ˆ

β

N

=(X

T

X)

−1

X

T

y. Su ppose that

the entire data set is p artiti oned into K subsets with y

k

and X

k

being the values of the response and covariates,

and

ˆ

β

k

=(X

T

k

X

k

)

−1

X

T

k

y

k

the OLS estimate in the kth

subset, k =1,...,K. Then, we have y =(y

T

1

,...,y

T

K

)

T

and X =(X

T

1

,...,X

T

K

)

T

.SinceX

T

X =



K

k=1

X

T

k

X

k

and

X

T

y =



K

k=1

X

T

k

y

k

, the regression cube technique sees that

(1)

ˆ

β

N

=(X

T

X)

−1

X

T

y =



K



k=1

X

T

k

X

k



−1

K



k=1

X

T

k

X

k

ˆ

β

k

,

which suggests that we can compute the OLS estimate for

the entire data set without accessing the raw data after sav-

ing (X

T

k

X

k

,

ˆ

β

k

) for each subset. The size of (X

T

k

X

k

,

ˆ

β

k

)is

p

2

+ p, so we only need to save Kp(p + 1) numbers, which

achieves very eﬃcient compression since both K and p are

far less than N in practice. The success of this technique

thanks to the linearity of the estimating equation in param-

eter β and the estimating equation of the entire data set is

a simple summation of the equations in all subsets. That is,

X

T

(y − Xβ)=



K

k=1

X

T

k

(y

k

− X

k

β)=0.

3. THE AEE ESTIMATOR

In this section, we consider, more generally, estimating

equation estimation in massive data sets and propose our

AEE estimator to p rovide a computationally tractable esti-

mator by approximation and aggregation.

Given independent observations {z

i

,i =1,...,N},

suppose that there exists β

0

∈ R

p

such that



N

i=1

E[ψ(z

i

, β

0

)] = 0 for some score function ψ.Thescore

function is a vector function of the same dimension p as the

parameter in general. The EE estimator

ˆ

β

N

of β

0

is deﬁned

as the solution to the estimating equation



N

i=1

ψ(z

i

, β)=

0. In regression analysis, we have z

i

=(y

i

, x

T

i

) with response

variable y and predictor x and the score function is usually

given as ψ(z, β)=φ(y −x

T

β)x for some function φ.When

φ is the identify function, the estimating equation is linear

in β and the resulting estimator is the OLS estimator. How-

ever, the score function ψ is more often nonlinear, and this

nonlinearity imposes diﬃculty to ﬁnd l ow-dimensional sum-

mary statistics based on which the EE estimate for the entire

data set can be obtained by aggregation as in (

1). Therefore,

we adjust our aim to ﬁnding an estimator that accurately

approximates the EE estimator, a nd can still be computed

by aggregation. Our basic idea is to approximate the non-

linear estimating equation by its ﬁrst-order approximation,

whose linearity then allows us to ﬁnd representations sim-

ilar to (

1) and hence the proper low-dimensional summary

statistics.

Again, consider partitioning the entire data set into K

subsets. To simplify our notation, we assume that all sub-

sets are of equal size n. This condition is not necessary for

our theory, though. Denote the observations in the kth sub-

set by z

k1

,...,z

kn

. The EE estimate

ˆ

β

nk

based on observa-

tions in the kth subset is then the solution to the following

estimating equation,

(2) M

k

(β)=

n



i=1

ψ(z

ki

, β)=0.

Let

(3) A

k

= −

n



i=1

∂ψ(z

ki

,

ˆ

β

nk

)

∂β

.

Since M

k

(

ˆ

β

nk

)=0,wehaveM

k

(β)=A

k

(β −

ˆ

β

nk

)+

R

2

= F

k

(β)+R

2

from the Taylor expansion of M

k

(β)at

ˆ

β

nk

,whereR

2

is t he residual term in the Taylor expansion.

The AEE estimator

ˆ

β

NK

is then the solution to F(β)=



K

k=1

F

k

(β)=0, which leads to

(4)

˜

β

NK

=



K



k=1

A

k



−1

K



k=1

A

k

ˆ

β

nk

.

This representation suggests the following algorithm to com-

pute the AEE estimator.

1. Partition. Partition the entire data set into K subsets

with each containable in the computer’s memory.

2. Compression.Forthekth subset, save (

ˆ

β

nk

, A

k

)and

discard the raw data. Repeat for k =1,...,K.

3. Aggregation. Calculate the AEE estimator

˜

β

NK

us-

ing (

4).

This implementation makes it feasible to process massive

data sets on regular computers as long as each partition is

manageable to the computer. It also provides a very eﬃcient

storage solution because only K(p

2

+ p) numbers need to be

stored after compressing the data.

4. ASYMPTOTIC PROPERTIES

In this section, we give the consistency of the AEE esti-

mator. Theorem

4.1 gives the strong consistency the AEE

estimator f or ﬁnite K.Theorem

4.2 further shows that

when K goes to inﬁnity not too fast, the AEE estima-

tor is a consistent estimator under some regularity condi-

tions. Theorem

4.2 is very useful to prove the asymptotic

74 N. Lin and R. Xi

equivalence of the AEE estimator and the EE estimator. In

the next section, we apply Theorem

4.2 to the aggregated

quasi-likelihood estimators (QLE) and show its asymptotic

equivalence to its original QLE. Let the score function b e

ψ(z

i

, β)=(ψ

1

(z

i

, β),...,ψ

p

(z

i

, β))

T

. We ﬁrst specify some

technical conditions.

(C1) The score function ψ is measurable for any ﬁxed β and

is twice continuously diﬀerentiable with respect to β.

(C2) The matrix −

∂ψ(z

i

,β)

∂β

is semi-positive deﬁnite (s.p.d.),

and −



n

i=1

∂ψ(z

i

,β)

∂β

is positive deﬁnite (p.d. ) in a

neighborhood of β

0

when n is lar ge enough.

(C3) The EE estimator

ˆ

β

n

is strongly consistent, i.e.

ˆ

β

n

→

β

0

almost surely (a.s.) as n →∞.

(C4) There exists two p.d. matrices, Λ

1

and Λ

2

such that

Λ

1

≤ n

−1

A

k

≤ Λ

2

for all k =1,...,K, i.e. for any

v ∈ R

p

, v

T

Λ

1

v ≤ n

−1

v

T

A

k

v ≤ v

T

Λ

2

v,whereA

k

is

given in (

3).

(C5) In a neighborhood of β

0

, the norm of the second-

order derivatives

∂

2

ψ

j

(z

i

,β)

∂β

2

is bounded unif ormly, i.e.



∂

2

ψ

j

(z

i

,β)

∂β

2

≤C

2

for all i, j,whereC

2

is a constant.

(C6) There exists a real number α ∈ (1/4, 1/2) such that f or

any η>0, the EE estimator

ˆ

β

n

satisﬁes P (n

α



ˆ

β

n

−

β

0

 >η) ≤ C

η

n

2α−1

,whereC

η

> 0 is a constant only

depending on η.

Under Condition (C2), the matrices A

k

is positive def-

inite in probability and therefore the AEE estimator

˜

β

NK

is well-deﬁned in probability. Condition (C3) is necessary

for the strong consistency of the AEE estimator and is sat-

isﬁed by almost all EE estimators in practice. Conditions

(C4) and (C5) are required to prove the strong consistency

of the AEE estimator, and are often true when each sub-

set contains enough observations. Condition (C6) is useful

in showing the consistency of the AEE estimator and the

asymptotic equivalence of the AEE and EE estimators when

the partition number K also goes to inﬁnity as the number

of observations goes to inﬁnity. In Section 5, we will show

that Condition (C6) is satisﬁed for the quasi-likelih ood es-

timators considered in [

5] under some regularity conditions.

Theorem 4.1. Let k

0

=argmax

1≤k≤K

{

ˆ

β

nk

− β

0

}.Un-

der Conditions (C1)–(C3), if the partition number K is

bounded, we have 

˜

β

NK

− β

0

≤K

ˆ

β

nk

0

− β

0

. If Condi-

tion (C4) is also true, we have 

˜

β

NK

−β

0

≤C

ˆ

β

nk

0

−β

0



for some constant C independent of n and K. Furthermore,

if Condition (C5) is satisﬁed, we have 

˜

β

NK

−

ˆ

β

N

≤

C

1

(

ˆ

β

nk

0

− β

0



2

+ 

ˆ

β

N

− β

0



2

) for some constant C

1

in-

dependent of n and K.

Theorem 1 shows that if the partition number K is

bounded, then the AEE estimator is also strongly consistent.

Usually, we have 

ˆ

β

N

− β

0

 = o(

ˆ

β

nk

0

− β

0

). Therefore,

the last part of Theorem

4.1 implies that 

˜

β

NK

−

ˆ

β

0

≤

2C

ˆ

β

nk

0

− β

0



2

+ 

ˆ

β

N

− β

0

.

Theorem 4.2. Let

ˆ

β

N

be the EE estimator based on the

entire data set. Then under Conditions (C1)–(C2), (C4)–

(C6), if the partition number K satisﬁes K = O(n

γ

) for

some 0 <γ<min{1 −2α, 4α −1} , we have P (

√

N

˜

β

NK

−

ˆ

β

N

 >δ)=o(1) for any δ>0.

Theorem

4.2 tells us that if the EE estimator

ˆ

β

N

is a con-

sistent estimator and the partition number K go es to inﬁnity

slowly, then the AEE estimator

˜

β

NK

is also a consistent es-

timator. In general, one can easily use Theorem

4.2 to show

the asymptotic normality of the AEE estimator if the EE es-

timator is asymptotically normally distributed, and further

to prove the asymptotic equivalence of the two estimators.

5. THE AGGREGATED QLE

In this section, we demonstrate the applicability o f the

AEE technique to quasi-likelihood estimation and call the

resulted estimator the aggregated quasi-likelihood estima-

tor (AQLE). We consider a simpliﬁed version of the QLE

discussed in [

5]. Suppose that we have N independent ob-

servations (y

i

, x

i

), i =1,...,N,wherey is a scalar response

and x is a p-dimensional vector of explanatory variables.

Let μ be a continuously diﬀerentiable function such that

˙μ(t)=dμ/dt > 0 for all t. Suppose that we have

(5) E(y

i

)=μ(β

T

0

x

i

) i =1,...,N.

for some β

0

∈ R

p

. Then the QLE of β

0

,

ˆ

β

N

, is the solution

to the estimating equation

(6) Q(β)=

N



i=1

[y

i

− μ(β

T

x

i

)]x

i

=0,

Let ε

i

= y

i

− μ(β

T

0

x

i

)andσ

2

i

=Var(y

i

). The following

theorem shows that Condition (C6) is satisﬁed for the QLE

under some regularity conditions.

Theorem 5.1. Consider a generalized linear model speciﬁed

by (

5) with ﬁxed design. Suppose that y

i

’s are independent

and that λ

N

is the minimum eigenvalue of



N

i=1

x

i

x

T

i

.If

there are two positive constants C and M such that λ

N

/N >

C and sup

i

{x

i

, σ

2

i

} ≤ M , then for any η>0 and α ∈

(0, 1/2),

P (N

α



ˆ

β

N

− β >η) ≤ C

1

(m

η

η)

−2

N

2α−1

,

where C

1

= pM

3

C

−3

is a constant, and m

η

> 0 is a con-

stant only depending on η.

Now suppose that the entire data set is partitioned into

K subsets. Let {(y

ki

, x

ki

)}

n

i=1

be the observations in the kth

subset with n = N/K .

(B1) The link function μ is twice continuously diﬀerentiable

and the derivative of the link function is always posi-

tive, i.e. ˙μ(t) > 0.

Aggregated estimating equation estimation 75

(B2) The vectors x

ki

are ﬁxed and uniformly bounded, and

the minimum eigenvalue λ

k

of



n

j=1

x

kj

x

T

kj

satisﬁes

λ

k

/n > C > 0 for all k and n.

(B3) The variances of y

ki

, σ

2

ki

, are bounded uniformly.

Condition (B1) is needed for Conditions (C1) and (C5).

Conditions (B1)–(B2) together guarantee Conditions (C2),

(C4) and (C5). And it is easy to verify that all the conditions

assumed in Theorem 1 of [

5] are satisﬁed under Conditions

(B1)–(B2). Hence, by Theorem 1 in [5] the QLEs

ˆ

β

nk

are

strongly consistent. Theorem

5.1 implies that the QLEs

ˆ

β

nk

satisfy Condition (C6) under Conditions (B1)–(B3). There-

fore, the conclusions in Theorem

4.1 and Theorem 4.2 hold

for the AQLE under Conditions (B1)–(B3). Furthermore,

the AQLE

˜

β

NK

has the following asymptotic normality.

Theorem 5.2. Let Σ

N

=



N

i=1

σ

2

i

x

i

x

T

i

and D

N

(β)=

−



N

i=1

˙μ(x

T

i

β)x

T

i

x

i

. Suppose that there exist a constant

c

1

such that σ

2

i

>c

2

1

for all i and sup

i

E(|ε

i

|

r

) < ∞

for some r>2. Then under Conditions (B1)–(B3),if

K = O(n

γ

) for some 0 <γ<min{1 −2α, 4α −1}, we have

Σ

−1/2

N

D

N

(β

0

)(

˜

β

NK

− β

0

)

d

−→N (0,I

p

) and

˜

β

NK

is asymp-

totically equivalent to the QLE

ˆ

β

N

.

6. SIMULATION STUDIES AND REAL

DATA ANALYSIS

6.1 Simulation

In this section, we illustrate the computational advan-

tages of the AEE estimator by simulation studies. We con-

sider computi ng the maximum likelih ood estimator (MLE)

of the regression coeﬃcients in logistic regression with ﬁve

predictors x

1

,...,x

5

.Lety

i

be the binary response and

x

i

=(1,x

i1

,...,x

i5

)

T

. In a logistic regression model, we

have

Pr(y

i

=1)=μ(x

T

i

β)=

exp(x

T

i

β)

1+exp(x

T

i

β)

,i=1,...,N.

And the MLE of the regression coeﬃcients β is a special case

of the QLE discussed in Section 5. We set the true regression

coeﬃcients as β =(β

0

,β

1

,...,β

5

)=(1, 2, 3, 4, 5, 6) and the

sample size as N = 500, 000. The predictor values are drawn

independently from the standard normal distribution.

We then compute

˜

β

NK

, the AEE estimate

of β, with diﬀerent partition numbers for K =

1, 000, 950,...,100, 90,...,10. In compressing the sub-

sets, we use the Newton-Raphson method to calculate the

MLE

ˆ

β

nk

in every subset k, k =1,...,K. For comparison,

we also compute

ˆ

β

N

, the MLE from the entire data set,

which is equivalent to

˜

β

NK

when K = 1. All programs

are written in C and our computer has a 1.6GHz Pentium

processor and 512MB memory.

Figure

1 plots the relative bias 

˜

β

NK

−β

0

/β

0

 against

the number of partitions K. The linearly increasing trend

Figure 1. Relative bias against number of partitions.

can be well explained by our theory. In Section 4.1, we

argued that the magnitude of 

˜

β

NK

− β

0

 is close to

2C

1



ˆ

β

nk

0

− β

0



2

+ 

ˆ

β

N

− β

0

.FromTheorem1in[

5],

we have 

ˆ

β

nk

0

− β

0



2

= o([log n]

1+δ

/n). S in ce log n ≪ n,



˜

β

NK

− β

0

 is close to o(1/n)=o(K/N), which increases

linearly with K when N is held ﬁxed. Since N is ﬁxed,



ˆ

β

N

−β

0

 is ﬁxed and so 

˜

β

NK

−β

0

 will roughly increase

linearly with K.

Figure

2 pl ots the computational time against the num-

ber of partitions. It takes 290 seconds to compute the MLE

(K = 1) and 128 seconds to compute the AEE estimator

when K = 10, which shows a computational time reduction

of more than 50%. As K increases, the computational time

soon stabilizes. This shows that we may choose a relatively

small K as long as the size of each subset does not exceed

the storage limit or memory constraint. On the other hand,

we see that the AEE estimator provides not only an eﬃcient

storage solution, but also a viable way to achieve more eﬃ-

cient computation even when the EE estimate using all the

raw data can be computed.

Next, we will show that t he AEE estimator is more ac-

curate than estimates based on sub-sampling. In our study,

we can view

ˆ

β

nk

from each subset as estimates based on a

sub-sample of the entire data set. Table

1 presents the p er-

centages of

ˆ

β

nk

with relative bias 

ˆ

β

nk

− β

0

/β

0

 ab ove

that of the AEE estimator for diﬀerent partition numbers.

It is seen that that more than 90% of

ˆ

β

nk

’s have a relative

bias larger than that of the

˜

β

NK

, which clearly shows that

the AEE estimator is more accurate than estimators based

on sub-sampling.

76 N. Lin and R. Xi

Figure 2. Computation time against number of partition K.

Table 1. Performance of

ˆ

β

nk

K 500 100 50 10

Percentage 94% 97% 94% 90%

6.2 Real data analysis

In this section, we apply our aggregation on a real data

set. In [

8], Chiang et al. used next-generation sequenc-

ing (NGS) to detect copy number variation in the sample

genome. It is known that the current NGS platforms have

various biases [

9], for example, GC-bias can lead to uneven

distribution of short reads on the genome. Another impor-

tant factor that can inﬂuence the read distribution is the

mappability [

23] of the genomic positions. Speciﬁcally, due

to the existence of the segmental duplication and repeat se-

quences, a short sequence (e.g. 35 bp short sequence) start-

ing from a genomic position may have many copies in the ref-

erence genome, making this genomic position not uniquely

mappable. Hence, variation of the mappability a cross the

reference genome will also lead to uneven distribution of

uniquely mapped reads. Here, we are interested in how the

number of reads in a certain genomic window is related with

factors like GC-content and mappability.

We use the sequencing data of the matched normal

genome of the cell line H2347 in [

8] to study how the num-

ber of reads relate with other factors. We ﬁrst binned the

uniquely mapped reads into 1,000 bp bins and counted the

number of reads in each bin. Then, for each bin, we counted

how many nucleotides are G, C and A. Since the bin size is

known, once we know nucleotide counts for G, C and A, we

basically know how many nucleotides are T. For each bin,

Figure 3. The number of subsets K used for each

chromosome.

we also counted how many genomic positions are uniquely

mappable (35 bp short sequence). Assume that the number

of reads in the ith bin follow a Poisson distribution with

parameter λ

i

. We consider the followin g model

log(λ

i

)=β

0

+β

1

log(G)+β

2

log(C)+β

3

log(A)+β

4

log(M),

where G, C, A are the G, C, and A count in the ith bin, and

M is the proportion of the uniquely mappable positions. To

avoid taking logarithm of zero, we added a small numb er

on G, C and A count (0.1) and the mapp abil ity (0.0001).

Then, for each chromosome (chromosome 1,...,22 and X),

we compared the MLE

ˆ

β with its corresponding AEE esti-

mate

˜

β of the Poisson regression model. To calculate the

AEE estimate, for each chromosome, we partitioned the

data set into K subsets such that each subset had 5,000 data

points (maybe except one subset). Figure

3 shows t he num-

ber of subsets K used for each chromosome. Then, for each

chromosome, we calculated the relative bias 

˜

β −

ˆ

β/

ˆ

β

(Figure

4). From Figure 4, we see that the MLE and its cor-

responding AEE estimates are very close, showing that our

aggregation performs well in this data set.

7. APPLICATIONS: DATA CUBES AND

DATA STREAMS

In this section, we discuss applications of the AEE estima-

tor i n two massive data environments: data cubes and data

streams. Analysis in both environments require performing

the same analysis for diﬀerent subsets while the raw data

often can not be saved permanently. Eﬃcient compression

Aggregated estimating equation estimation 77

Aggregated estimating equation estimation

Citations

Aging of Liver Transplant Registrants and Recipients: Trends and Impact on Waitlist Outcomes, Post-Transplantation Outcomes, and Transplant-Related Survival Benefit

Information-Based Optimal Subdata Selection for Big Data Linear Regression

Online Updating of Statistical Inference in the Big Data Setting

Statistical methods and computing for big data.

A split‐and‐merge Bayesian variable selection approach for ultrahigh dimensional regression

References

Linear statistical inference and its applications

Data cube: a relational aggregation operator generalizing GROUP-BY, CROSS-TAB, and SUB-TOTALS

Quasi-likelihood functions, generalized linear models, and the Gauss-Newton method

Data Cube: A Relational Aggregation Operator Generalizing Group-By, Cross-Tab, and Sub-Totals

A framework for clustering evolving data streams

Related Papers (5)

A Split-and-Conquer Approach for Analysis of Extraordinarily Large Data

A scalable bootstrap for massive data

A statistical perspective on algorithmic leveraging

Challenges of Big Data analysis

Regression Shrinkage and Selection via the Lasso