scispace - formally typeset
Open AccessJournal ArticleDOI

Aggregated estimating equation estimation

Reads0
Chats0
TLDR
A computation and storage efficient algorithm for estimating equation (EE) estimation in massive data sets using a “divide-and-conquer” strategy that is strongly consistent and asymptotically equivalent to the EE estimator.
Abstract
Motivated by the recent active research on online analytical processing (OLAP), we develop a computation and storage efficient algorithm for estimating equation (EE) estimation in massive data sets using a “divide-and-conquer” strategy. In each partition of the data set, we compress the raw data into some low dimensional statistics and then discard the raw data. Then, we obtain an approximation to the EE estimator, the aggregated EE (AEE) estimator, by solving an equation aggregated from the saved low dimensional statistics in all partitions. Such low dimensional statistics are taken as the EE estimates and first-order derivatives of the estimating equations in each partition. We show that, under proper partitioning and some regularity conditions, the AEE estimator is strongly consistent and asymptotically equivalent to the EE estimator. A major application of the AEE technique is to support fast OLAP of EE estimations for data warehousing technologies such as data cubes and data streams. It can also be used to reduce the computation time and conquer the memory constraint problem posed by massive data sets. Simulation studies show that the AEE estimator provides efficient storage and remarkable deduction in computational time, especially in its applications to data cubes and data streams.

read more

Content maybe subject to copyright    Report

Statistics and Its Interface Volume 4 (2011) 73–83
Aggregated estimating equation estimation
Nan Lin
and Ruibin Xi
Motivated by the recent active research on online ana-
lytical processing (OLAP), we develop a computation and
storage efficient algorithm for estimating equation (EE) es-
timation in massive data sets using a “divide-and-conquer”
strategy. In each partition of the data set, we compress the
raw data into some low dimensional statistics and then dis-
card the raw data. Then, we obtain an approximation to the
EE estimator, the aggregated EE (AEE) estimator, by solv-
ing an equation aggregated from the saved low dimensional
statistics in all partitions. Such low dimensional statistics
are taken as the EE estimates and first-order derivatives of
the estimating equations in each partition.
We show that, under proper partiti oni ng and some regu-
larity conditions, the AEE estimator is strongly consistent
and asymptotically equivalent to the E E estimator. A major
application of the AEE technique is to support fast OLAP
of EE estimations for data warehousing technologies such
as data cubes and data streams. It can also be used to re-
duce the computation time and conqu er the memory con-
straint problem posed by massive data sets. Simulation stud-
ies show that the AEE estimator provides efficient storage
and remarkable deduction in computational t ime, especially
in its applications to data cubes and data streams.
Keywords and phrases: Massive data sets, Estimat-
ing equation, D ata compression, Aggregation, Consistency,
Asymptotic normality, Data cube.
1. INTRODUCTION
Two major challenges in analyzing massive data sets are
storage and computational efficiency. In recent years, there
have been active researches on developing compression and
aggregation schemes to support fast online analytical pro-
cessing (OLAP) of various statistical analyses, such as linear
regression [
7, 14], general multiple linear regression [6, 19],
logistic regression analysis [
26], predictive filters [6], naive
Bayesian classifiers [
4] and linear discriminant analysis [22].
The OLAP analysis is usually associated with data ware-
housing technologies such as data cubes [
1, 12, 27] and data
streams [
16, 21], where fast responses to queries are often
needed. The response time of any OLAP tool should be in
Corresponding author.
This work was done when Ruibin Xi was a PhD student in the De-
partment of Mathematics, Washington University in St. Louis.
the order of seconds, at most minutes, even if complex sta-
tistical analyses are involved.
Most of the current OLAP t ools can only supp ort simple
analyses that are essentially linear operators [
7, 6, 14, 19].
However, many advanced statistical analyses are nonlinear
and thus most of the current OLAP tools cannot be used
to support these advanced analyses. In this paper, we de-
veloped a compression and aggregation strat egy to support
fast OLAP analysis for estimating equation (EE) estima-
tors. The EE estimators are a very large family of esti-
mators and many statistical estimation techniques can be
unified into the framework of EE estimators, including the
ordinary least square (OLS) estimator, the quasi-likelihood
estimator (QLE) [25] and the robust M-estimator [17]. The
scheme developed in this paper can not only support fast
OLAP of EE estimation, but also can be used to reduce the
computation time of the EE estimates and solve the memory
constraint problem imposed by massive data sets.
The compression and aggregation technique developed in
this paper is based on the “divide-and-conquer” strategy. We
first partition the massive data sets into K subsets and then
compress the raw data into the EE estimates and the first-
order derivative of the estimating equation before discarding
the raw data. The saved statistics allow reconstructing an
approximation to the original estimating equation in each
subset, and hence an approximation to the equation f or the
entire data set after aggregating over all subsets. We show in
theory that the proposed aggregated EE (AEE) estimator is
asymptotically equivalent to th e EE estimator if the number
of partitions K does not go to infinity too fast. Simulation
studies validate the theory and show that the AEE estimator
is computationally very efficient. Our results also show that
the AEE estimator provides more accurate estimates than
estimates from a subsample of the entire data set, which is
commonly done for static massive data sets.
The remainder of the paper is organized as follows. We
first review regression cube [
6] in Section 2 and then present
the AEE estimator in Section 3 with its asymptotic proper-
ties given in Section 4. In Section 5, we study the application
of the AEE estimator to QLE and provide asymptotic prop-
erties for the resulted aggregated QLE. Sections 6 and 7
study the performance of the AEE estimator and its appli-
cations to data cub es and data streams through simulation
studies. And at last, Section 8 concludes the paper and pro-
vides some discussion. All proofs are given in the Appendix.

2. AGGREGATION FOR LINEAR
REGRESSION
In this section, we review the regression cube technique
[
6] to illustrate the idea of aggregation for linear regression
analysis.
Suppose that we have N independent observations
(y
1
, x
1
),...,(y
N
, x
N
), where y
i
is a scalar response, x
i
is
a p ×1 covariate vector, i =1,...,N.Lety =(y
1
,...,y
N
)
T
and X =(x
1
,...,x
N
)
T
. A linear regression model assumes
that E(y)=Xβ. Suppose that X
T
X is invertible, the
OLS estimator of β is
ˆ
β
N
=(X
T
X)
1
X
T
y. Su ppose that
the entire data set is p artiti oned into K subsets with y
k
and X
k
being the values of the response and covariates,
and
ˆ
β
k
=(X
T
k
X
k
)
1
X
T
k
y
k
the OLS estimate in the kth
subset, k =1,...,K. Then, we have y =(y
T
1
,...,y
T
K
)
T
and X =(X
T
1
,...,X
T
K
)
T
.SinceX
T
X =
K
k=1
X
T
k
X
k
and
X
T
y =
K
k=1
X
T
k
y
k
, the regression cube technique sees that
(1)
ˆ
β
N
=(X
T
X)
1
X
T
y =
K
k=1
X
T
k
X
k
1
K
k=1
X
T
k
X
k
ˆ
β
k
,
which suggests that we can compute the OLS estimate for
the entire data set without accessing the raw data after sav-
ing (X
T
k
X
k
,
ˆ
β
k
) for each subset. The size of (X
T
k
X
k
,
ˆ
β
k
)is
p
2
+ p, so we only need to save Kp(p + 1) numbers, which
achieves very efficient compression since both K and p are
far less than N in practice. The success of this technique
thanks to the linearity of the estimating equation in param-
eter β and the estimating equation of the entire data set is
a simple summation of the equations in all subsets. That is,
X
T
(y Xβ)=
K
k=1
X
T
k
(y
k
X
k
β)=0.
3. THE AEE ESTIMATOR
In this section, we consider, more generally, estimating
equation estimation in massive data sets and propose our
AEE estimator to p rovide a computationally tractable esti-
mator by approximation and aggregation.
Given independent observations {z
i
,i =1,...,N},
suppose that there exists β
0
R
p
such that
N
i=1
E[ψ(z
i
, β
0
)] = 0 for some score function ψ.Thescore
function is a vector function of the same dimension p as the
parameter in general. The EE estimator
ˆ
β
N
of β
0
is defined
as the solution to the estimating equation
N
i=1
ψ(z
i
, β)=
0. In regression analysis, we have z
i
=(y
i
, x
T
i
) with response
variable y and predictor x and the score function is usually
given as ψ(z, β)=φ(y x
T
β)x for some function φ.When
φ is the identify function, the estimating equation is linear
in β and the resulting estimator is the OLS estimator. How-
ever, the score function ψ is more often nonlinear, and this
nonlinearity imposes difficulty to find l ow-dimensional sum-
mary statistics based on which the EE estimate for the entire
data set can be obtained by aggregation as in (
1). Therefore,
we adjust our aim to finding an estimator that accurately
approximates the EE estimator, a nd can still be computed
by aggregation. Our basic idea is to approximate the non-
linear estimating equation by its first-order approximation,
whose linearity then allows us to find representations sim-
ilar to (
1) and hence the proper low-dimensional summary
statistics.
Again, consider partitioning the entire data set into K
subsets. To simplify our notation, we assume that all sub-
sets are of equal size n. This condition is not necessary for
our theory, though. Denote the observations in the kth sub-
set by z
k1
,...,z
kn
. The EE estimate
ˆ
β
nk
based on observa-
tions in the kth subset is then the solution to the following
estimating equation,
(2) M
k
(β)=
n
i=1
ψ(z
ki
, β)=0.
Let
(3) A
k
=
n
i=1
ψ(z
ki
,
ˆ
β
nk
)
β
.
Since M
k
(
ˆ
β
nk
)=0,wehaveM
k
(β)=A
k
(β
ˆ
β
nk
)+
R
2
= F
k
(β)+R
2
from the Taylor expansion of M
k
(β)at
ˆ
β
nk
,whereR
2
is t he residual term in the Taylor expansion.
The AEE estimator
ˆ
β
NK
is then the solution to F(β)=
K
k=1
F
k
(β)=0, which leads to
(4)
˜
β
NK
=
K
k=1
A
k
1
K
k=1
A
k
ˆ
β
nk
.
This representation suggests the following algorithm to com-
pute the AEE estimator.
1. Partition. Partition the entire data set into K subsets
with each containable in the computer’s memory.
2. Compression.Forthekth subset, save (
ˆ
β
nk
, A
k
)and
discard the raw data. Repeat for k =1,...,K.
3. Aggregation. Calculate the AEE estimator
˜
β
NK
us-
ing (
4).
This implementation makes it feasible to process massive
data sets on regular computers as long as each partition is
manageable to the computer. It also provides a very efficient
storage solution because only K(p
2
+ p) numbers need to be
stored after compressing the data.
4. ASYMPTOTIC PROPERTIES
In this section, we give the consistency of the AEE esti-
mator. Theorem
4.1 gives the strong consistency the AEE
estimator f or finite K.Theorem
4.2 further shows that
when K goes to infinity not too fast, the AEE estima-
tor is a consistent estimator under some regularity condi-
tions. Theorem
4.2 is very useful to prove the asymptotic
74 N. Lin and R. Xi

equivalence of the AEE estimator and the EE estimator. In
the next section, we apply Theorem
4.2 to the aggregated
quasi-likelihood estimators (QLE) and show its asymptotic
equivalence to its original QLE. Let the score function b e
ψ(z
i
, β)=(ψ
1
(z
i
, β),...,ψ
p
(z
i
, β))
T
. We first specify some
technical conditions.
(C1) The score function ψ is measurable for any fixed β and
is twice continuously differentiable with respect to β.
(C2) The matrix
ψ(z
i
,β)
β
is semi-positive definite (s.p.d.),
and
n
i=1
ψ(z
i
,β)
β
is positive definite (p.d. ) in a
neighborhood of β
0
when n is lar ge enough.
(C3) The EE estimator
ˆ
β
n
is strongly consistent, i.e.
ˆ
β
n
β
0
almost surely (a.s.) as n →∞.
(C4) There exists two p.d. matrices, Λ
1
and Λ
2
such that
Λ
1
n
1
A
k
Λ
2
for all k =1,...,K, i.e. for any
v R
p
, v
T
Λ
1
v n
1
v
T
A
k
v v
T
Λ
2
v,whereA
k
is
given in (
3).
(C5) In a neighborhood of β
0
, the norm of the second-
order derivatives
2
ψ
j
(z
i
,β)
β
2
is bounded unif ormly, i.e.
2
ψ
j
(z
i
,β)
β
2
≤C
2
for all i, j,whereC
2
is a constant.
(C6) There exists a real number α (1/4, 1/2) such that f or
any η>0, the EE estimator
ˆ
β
n
satisfies P (n
α
ˆ
β
n
β
0
) C
η
n
2α1
,whereC
η
> 0 is a constant only
depending on η.
Under Condition (C2), the matrices A
k
is positive def-
inite in probability and therefore the AEE estimator
˜
β
NK
is well-defined in probability. Condition (C3) is necessary
for the strong consistency of the AEE estimator and is sat-
isfied by almost all EE estimators in practice. Conditions
(C4) and (C5) are required to prove the strong consistency
of the AEE estimator, and are often true when each sub-
set contains enough observations. Condition (C6) is useful
in showing the consistency of the AEE estimator and the
asymptotic equivalence of the AEE and EE estimators when
the partition number K also goes to infinity as the number
of observations goes to infinity. In Section 5, we will show
that Condition (C6) is satisfied for the quasi-likelih ood es-
timators considered in [
5] under some regularity conditions.
Theorem 4.1. Let k
0
=argmax
1kK
{
ˆ
β
nk
β
0
}.Un-
der Conditions (C1)–(C3), if the partition number K is
bounded, we have
˜
β
NK
β
0
≤K
ˆ
β
nk
0
β
0
. If Condi-
tion (C4) is also true, we have
˜
β
NK
β
0
≤C
ˆ
β
nk
0
β
0
for some constant C independent of n and K. Furthermore,
if Condition (C5) is satisfied, we have
˜
β
NK
ˆ
β
N
≤
C
1
(
ˆ
β
nk
0
β
0
2
+
ˆ
β
N
β
0
2
) for some constant C
1
in-
dependent of n and K.
Theorem 1 shows that if the partition number K is
bounded, then the AEE estimator is also strongly consistent.
Usually, we have
ˆ
β
N
β
0
= o(
ˆ
β
nk
0
β
0
). Therefore,
the last part of Theorem
4.1 implies that
˜
β
NK
ˆ
β
0
≤
2C
ˆ
β
nk
0
β
0
2
+
ˆ
β
N
β
0
.
Theorem 4.2. Let
ˆ
β
N
be the EE estimator based on the
entire data set. Then under Conditions (C1)–(C2), (C4)–
(C6), if the partition number K satisfies K = O(n
γ
) for
some 0 <min{1 2α, 4α 1} , we have P (
N
˜
β
NK
ˆ
β
N
)=o(1) for any δ>0.
Theorem
4.2 tells us that if the EE estimator
ˆ
β
N
is a con-
sistent estimator and the partition number K go es to infinity
slowly, then the AEE estimator
˜
β
NK
is also a consistent es-
timator. In general, one can easily use Theorem
4.2 to show
the asymptotic normality of the AEE estimator if the EE es-
timator is asymptotically normally distributed, and further
to prove the asymptotic equivalence of the two estimators.
5. THE AGGREGATED QLE
In this section, we demonstrate the applicability o f the
AEE technique to quasi-likelihood estimation and call the
resulted estimator the aggregated quasi-likelihood estima-
tor (AQLE). We consider a simplified version of the QLE
discussed in [
5]. Suppose that we have N independent ob-
servations (y
i
, x
i
), i =1,...,N,wherey is a scalar response
and x is a p-dimensional vector of explanatory variables.
Let μ be a continuously differentiable function such that
˙μ(t)=dμ/dt > 0 for all t. Suppose that we have
(5) E(y
i
)=μ(β
T
0
x
i
) i =1,...,N.
for some β
0
R
p
. Then the QLE of β
0
,
ˆ
β
N
, is the solution
to the estimating equation
(6) Q(β)=
N
i=1
[y
i
μ(β
T
x
i
)]x
i
=0,
Let ε
i
= y
i
μ(β
T
0
x
i
)andσ
2
i
=Var(y
i
). The following
theorem shows that Condition (C6) is satisfied for the QLE
under some regularity conditions.
Theorem 5.1. Consider a generalized linear model specified
by (
5) with fixed design. Suppose that y
i
’s are independent
and that λ
N
is the minimum eigenvalue of
N
i=1
x
i
x
T
i
.If
there are two positive constants C and M such that λ
N
/N >
C and sup
i
{x
i
, σ
2
i
} M , then for any η>0 and α
(0, 1/2),
P (N
α
ˆ
β
N
β ) C
1
(m
η
η)
2
N
2α1
,
where C
1
= pM
3
C
3
is a constant, and m
η
> 0 is a con-
stant only depending on η.
Now suppose that the entire data set is partitioned into
K subsets. Let {(y
ki
, x
ki
)}
n
i=1
be the observations in the kth
subset with n = N/K .
(B1) The link function μ is twice continuously differentiable
and the derivative of the link function is always posi-
tive, i.e. ˙μ(t) > 0.
Aggregated estimating equation estimation 75

(B2) The vectors x
ki
are fixed and uniformly bounded, and
the minimum eigenvalue λ
k
of
n
j=1
x
kj
x
T
kj
satisfies
λ
k
/n > C > 0 for all k and n.
(B3) The variances of y
ki
, σ
2
ki
, are bounded uniformly.
Condition (B1) is needed for Conditions (C1) and (C5).
Conditions (B1)–(B2) together guarantee Conditions (C2),
(C4) and (C5). And it is easy to verify that all the conditions
assumed in Theorem 1 of [
5] are satisfied under Conditions
(B1)–(B2). Hence, by Theorem 1 in [5] the QLEs
ˆ
β
nk
are
strongly consistent. Theorem
5.1 implies that the QLEs
ˆ
β
nk
satisfy Condition (C6) under Conditions (B1)–(B3). There-
fore, the conclusions in Theorem
4.1 and Theorem 4.2 hold
for the AQLE under Conditions (B1)–(B3). Furthermore,
the AQLE
˜
β
NK
has the following asymptotic normality.
Theorem 5.2. Let Σ
N
=
N
i=1
σ
2
i
x
i
x
T
i
and D
N
(β)=
N
i=1
˙μ(x
T
i
β)x
T
i
x
i
. Suppose that there exist a constant
c
1
such that σ
2
i
>c
2
1
for all i and sup
i
E(|ε
i
|
r
) <
for some r>2. Then under Conditions (B1)–(B3),if
K = O(n
γ
) for some 0 <min{1 2α, 4α 1}, we have
Σ
1/2
N
D
N
(β
0
)(
˜
β
NK
β
0
)
d
−→N (0,I
p
) and
˜
β
NK
is asymp-
totically equivalent to the QLE
ˆ
β
N
.
6. SIMULATION STUDIES AND REAL
DATA ANALYSIS
6.1 Simulation
In this section, we illustrate the computational advan-
tages of the AEE estimator by simulation studies. We con-
sider computi ng the maximum likelih ood estimator (MLE)
of the regression coefficients in logistic regression with five
predictors x
1
,...,x
5
.Lety
i
be the binary response and
x
i
=(1,x
i1
,...,x
i5
)
T
. In a logistic regression model, we
have
Pr(y
i
=1)=μ(x
T
i
β)=
exp(x
T
i
β)
1+exp(x
T
i
β)
,i=1,...,N.
And the MLE of the regression coefficients β is a special case
of the QLE discussed in Section 5. We set the true regression
coefficients as β =(β
0
1
,...,β
5
)=(1, 2, 3, 4, 5, 6) and the
sample size as N = 500, 000. The predictor values are drawn
independently from the standard normal distribution.
We then compute
˜
β
NK
, the AEE estimate
of β, with different partition numbers for K =
1, 000, 950,...,100, 90,...,10. In compressing the sub-
sets, we use the Newton-Raphson method to calculate the
MLE
ˆ
β
nk
in every subset k, k =1,...,K. For comparison,
we also compute
ˆ
β
N
, the MLE from the entire data set,
which is equivalent to
˜
β
NK
when K = 1. All programs
are written in C and our computer has a 1.6GHz Pentium
processor and 512MB memory.
Figure
1 plots the relative bias
˜
β
NK
β
0
/β
0
against
the number of partitions K. The linearly increasing trend
Figure 1. Relative bias against number of partitions.
can be well explained by our theory. In Section 4.1, we
argued that the magnitude of
˜
β
NK
β
0
is close to
2C
1
ˆ
β
nk
0
β
0
2
+
ˆ
β
N
β
0
.FromTheorem1in[
5],
we have
ˆ
β
nk
0
β
0
2
= o([log n]
1+δ
/n). S in ce log n n,
˜
β
NK
β
0
is close to o(1/n)=o(K/N), which increases
linearly with K when N is held fixed. Since N is fixed,
ˆ
β
N
β
0
is fixed and so
˜
β
NK
β
0
will roughly increase
linearly with K.
Figure
2 pl ots the computational time against the num-
ber of partitions. It takes 290 seconds to compute the MLE
(K = 1) and 128 seconds to compute the AEE estimator
when K = 10, which shows a computational time reduction
of more than 50%. As K increases, the computational time
soon stabilizes. This shows that we may choose a relatively
small K as long as the size of each subset does not exceed
the storage limit or memory constraint. On the other hand,
we see that the AEE estimator provides not only an efficient
storage solution, but also a viable way to achieve more effi-
cient computation even when the EE estimate using all the
raw data can be computed.
Next, we will show that t he AEE estimator is more ac-
curate than estimates based on sub-sampling. In our study,
we can view
ˆ
β
nk
from each subset as estimates based on a
sub-sample of the entire data set. Table
1 presents the p er-
centages of
ˆ
β
nk
with relative bias
ˆ
β
nk
β
0
/β
0
ab ove
that of the AEE estimator for different partition numbers.
It is seen that that more than 90% of
ˆ
β
nk
’s have a relative
bias larger than that of the
˜
β
NK
, which clearly shows that
the AEE estimator is more accurate than estimators based
on sub-sampling.
76 N. Lin and R. Xi

Figure 2. Computation time against number of partition K.
Table 1. Performance of
ˆ
β
nk
K 500 100 50 10
Percentage 94% 97% 94% 90%
6.2 Real data analysis
In this section, we apply our aggregation on a real data
set. In [
8], Chiang et al. used next-generation sequenc-
ing (NGS) to detect copy number variation in the sample
genome. It is known that the current NGS platforms have
various biases [
9], for example, GC-bias can lead to uneven
distribution of short reads on the genome. Another impor-
tant factor that can influence the read distribution is the
mappability [
23] of the genomic positions. Specifically, due
to the existence of the segmental duplication and repeat se-
quences, a short sequence (e.g. 35 bp short sequence) start-
ing from a genomic position may have many copies in the ref-
erence genome, making this genomic position not uniquely
mappable. Hence, variation of the mappability a cross the
reference genome will also lead to uneven distribution of
uniquely mapped reads. Here, we are interested in how the
number of reads in a certain genomic window is related with
factors like GC-content and mappability.
We use the sequencing data of the matched normal
genome of the cell line H2347 in [
8] to study how the num-
ber of reads relate with other factors. We first binned the
uniquely mapped reads into 1,000 bp bins and counted the
number of reads in each bin. Then, for each bin, we counted
how many nucleotides are G, C and A. Since the bin size is
known, once we know nucleotide counts for G, C and A, we
basically know how many nucleotides are T. For each bin,
Figure 3. The number of subsets K used for each
chromosome.
we also counted how many genomic positions are uniquely
mappable (35 bp short sequence). Assume that the number
of reads in the ith bin follow a Poisson distribution with
parameter λ
i
. We consider the followin g model
log(λ
i
)=β
0
+β
1
log(G)+β
2
log(C)+β
3
log(A)+β
4
log(M),
where G, C, A are the G, C, and A count in the ith bin, and
M is the proportion of the uniquely mappable positions. To
avoid taking logarithm of zero, we added a small numb er
on G, C and A count (0.1) and the mapp abil ity (0.0001).
Then, for each chromosome (chromosome 1,...,22 and X),
we compared the MLE
ˆ
β with its corresponding AEE esti-
mate
˜
β of the Poisson regression model. To calculate the
AEE estimate, for each chromosome, we partitioned the
data set into K subsets such that each subset had 5,000 data
points (maybe except one subset). Figure
3 shows t he num-
ber of subsets K used for each chromosome. Then, for each
chromosome, we calculated the relative bias
˜
β
ˆ
β/
ˆ
β
(Figure
4). From Figure 4, we see that the MLE and its cor-
responding AEE estimates are very close, showing that our
aggregation performs well in this data set.
7. APPLICATIONS: DATA CUBES AND
DATA STREAMS
In this section, we discuss applications of the AEE estima-
tor i n two massive data environments: data cubes and data
streams. Analysis in both environments require performing
the same analysis for different subsets while the raw data
often can not be saved permanently. Efficient compression
Aggregated estimating equation estimation 77

Citations
More filters
Journal ArticleDOI

Aging of Liver Transplant Registrants and Recipients: Trends and Impact on Waitlist Outcomes, Post-Transplantation Outcomes, and Transplant-Related Survival Benefit

TL;DR: Dramatic aging of liver transplant registrants and recipients occurred from 2002 to 2014, driven by aging of the hepatitis C virus-positive cohort and increased prevalence of nonalcoholic steatohepatitis and hepatocellular carcinoma.
Journal ArticleDOI

Information-Based Optimal Subdata Selection for Big Data Linear Regression

TL;DR: In many branches of science, a large amount of data are being produced in many branches and many proven statistical methods are no longer applicable with extraordinary large datasets due to computational limitations as mentioned in this paper.
Journal ArticleDOI

Online Updating of Statistical Inference in the Big Data Setting

TL;DR: In this article, the authors present statistical methods for big data arising from online analytical processing, where large amounts of data arrive in streams and require fast analysis without storage/access to the historical data.
Journal ArticleDOI

Statistical methods and computing for big data.

TL;DR: In this article, the authors summarized recent methodological and software developments in statistics that address the big data challenges and grouped them into three classes: subsampling-based, divide and conquer, and online updating for stream data.
Journal ArticleDOI

A split‐and‐merge Bayesian variable selection approach for ultrahigh dimensional regression

TL;DR: The numerical results show that the approach proposed generally outperforms penalized likelihood approaches: the models selected by the approach tend to be more sparse and closer to the true model.
References
More filters
Book

Linear statistical inference and its applications

TL;DR: Algebra of Vectors and Matrices, Probability Theory, Tools and Techniques, and Continuous Probability Models.
Journal ArticleDOI

Data cube: a relational aggregation operator generalizing GROUP-BY, CROSS-TAB, and SUB-TOTALS

TL;DR: The data cube operator as discussed by the authors generalizes the histogram, cross-tabulation, roll-up, drill-down, and sub-total constructs found in most report writers.
Journal ArticleDOI

Quasi-likelihood functions, generalized linear models, and the Gauss-Newton method

TL;DR: In this paper, the Gauss-Newton method for calculating nonlinear least squares estimates generalizes easily to deal with maximum quasi-likelihood estimates, and a rearrangement of this produces a generalization of the method described by Nelder & Wedderburn (1972).
Posted Content

Data Cube: A Relational Aggregation Operator Generalizing Group-By, Cross-Tab, and Sub-Totals

TL;DR: The cube operator as discussed by the authors generalizes the histogram, cross-tabulation, roll-up, drill-down, and sub-total constructs found in most report writers, and treats each of the N aggregation attributes as a dimension of N-space.
Book ChapterDOI

A framework for clustering evolving data streams

TL;DR: A fundamentally different philosophy for data stream clustering is discussed which is guided by application-centered requirements and uses the concepts of a pyramidal time frame in conjunction with a microclustering approach.
Related Papers (5)