scispace - formally typeset
Open AccessProceedings ArticleDOI

Learning Generalized Linear Models Over Normalized Data

Reads0
Chats0
TLDR
A new approach named factorized learning is introduced that pushes ML computations through joins and avoids redundancy in both I/O and computations and is often substantially faster than the alternatives, but is not always the fastest, necessitating a cost-based approach.
Abstract
Enterprise data analytics is a booming area in the data management industry. Many companies are racing to develop toolkits that closely integrate statistical and machine learning techniques with data management systems. Almost all such toolkits assume that the input to a learning algorithm is a single table. However, most relational datasets are not stored as single tables due to normalization. Thus, analysts often perform key-foreign key joins before learning on the join output. This strategy of learning after joins introduces redundancy avoided by normalization, which could lead to poorer end-to-end performance and maintenance overheads due to data duplication. In this work, we take a step towards enabling and optimizing learning over joins for a common class of machine learning techniques called generalized linear models that are solved using gradient descent algorithms in an RDBMS setting. We present alternative approaches to learn over a join that are easy to implement over existing RDBMSs. We introduce a new approach named factorized learning that pushes ML computations through joins and avoids redundancy in both I/O and computations. We study the tradeoff space for all our approaches both analytically and empirically. Our results show that factorized learning is often substantially faster than the alternatives, but is not always the fastest, necessitating a cost-based approach. We also discuss extensions of all our approaches to multi-table joins as well as to Hive.

read more

Content maybe subject to copyright    Report

Learning Generalized Linear Models Over Normalized Data
Arun Kumar Jeffrey Naughton Jignesh M. Patel
Department of Computer Sciences,
University of Wisconsin-Madison
{arun, naughton, jignesh}@cs.wisc.edu
ABSTRACT
Enterprise data analytics is a booming area in the data man-
agement industry. Many companies are racing to develop
toolkits that closely integrate statistical and machine learn-
ing techniques with data management systems. Almost all
such toolkits assume that the input to a learning algorithm
is a single table. However, most relational datasets are not
stored as single tables due to normalization. Thus, analysts
often perform key-foreign key joins before learning on the
join output. This strategy of learning after joins introduces
redundancy avoided by normalization, which could lead to
poorer end-to-end performance and maintenance overheads
due to data duplication. In this work, we take a step towards
enabling and optimizing learning over joins for a common
class of machine learning techniques called generalized linear
models that are solved using gradient descent algorithms in
an RDBMS setting. We present alternative approaches to
learn over a join that are easy to implement over existing
RDBMSs. We introduce a new approach named factorized
learning that pushes ML computations through joins and
avoids redundancy in both I/O and computations. We study
the tradeoff space for all our approaches both analytically
and empirically. Our results show that factorized learning
is often substantially faster than the alternatives, but is not
always the fastest, necessitating a cost-based approach. We
also discuss extensions of all our approaches to multi-table
joins as well as to Hive.
Categories and Subject Descriptors
H.2 [Information Systems]: Database Management
Keywords
Analytics; feature engineering; joins; machine learning
1. INTRODUCTION
There is an escalating arms race to bring sophisticated sta-
tistical and machine learning (ML) techniques to enterprise
applications [3, 5]. A number of projects in both industry
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for profit or commercial advantage and that copies bear this notice and the full cita-
tion on the first page. Copyrights for components of this work owned by others than
ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or re-
publish, to post on servers or to redistribute to lists, requires prior specific permission
and/or a fee. Request permissions from permissions@acm.org.
SIGMOD’15, May 31 June 04, 2015, Melbourne, VIC, Australia.
Copyright 2015 ACM 978-1-4503-2758-9/15/05 ...$15.00.
http://dx.doi.org/10.1145/2723372.2723713.
and academia aim to integrate ML capabilities with data
processing in RDBMSs, Hadoop, and other systems [2, 4, 9,
15,18,21,22,33,34]. Almost all such implementations of ML
algorithms require that the input dataset be a single table.
However, most relational datasets are not as stored single
tables due to normalization [27]. Thus, analysts often per-
form key-foreign key joins of the base tables and materialize
a single temporary table that is used as the input to the ML
algorithm, i.e., they learn after joins.
Example: Consider an insurance company analyst model-
ing customer churn (will a customer leave the company or
not) a standard classification task. She builds a logistic re-
gression model using the large table that stores customer de-
tails: Customers(CustomerID
, Churn, Age, Income, . . . ,
EmployerID). Note that one of the features, EmployerID, is
the ID of the customer’s employer. It is a foreign key that
refers to a separate table that stores details about compa-
nies and other organizations: Employers(EmployerID, Rev-
enue, NumEmployees, . . . ). She joins the two tables on the
EmployerID as part of her “feature engineering” because she
thinks the features of the employer might be helpful in pre-
dicting how likely a customer is to churn. For example, she
might have a hunch that customers employed by large cor-
porations are less likely to churn. She writes the output of
the join as a single temporary table and feeds it to an ML
toolkit that implements logistic regression.
Similar examples arise in a number of other application
domains, e.g., detecting malicious users by joining data about
user accounts with account activities, predicting census mail
response rates by joining data about census districts with
individual households, recommending products by joining
data about past ratings with users and products, etc.
Learning after joins imposes an artificial barrier between
the ML-based analysis and the base relations, resulting in
several practical issues. First, the table obtained after the
join can be much larger than the base tables themselves
because the join introduces redundancy that was originally
removed by database normalization [8, 27]. This results in
unnecessary overheads for storage and performance as well
as waste of time performing extra computations on data with
redundancy. Second, as the base tables evolve, maintaining
the materialized output of the join could become an over-
head. Finally, analysts often perform exploratory analysis
of different subsets of features and data [20, 33]. Materializ-
ing temporary tables after joins for learning on each subset
could slow the analyst and inhibit exploration [7]. Learning
over joins, i.e., pushing ML computations through joins to
the base tables, mitigates such drawbacks.

I/O Cost
CPU Cost
SR
M
FL
Storage
Runtime
SR
S
M
FL
S
I/O Cost
CPU Cost
SR
M
FL
S
Storage
Runtime
SR
M
FL
S
B
D
CPU Cost
Storage
Runtime
SR
S
FL
C
M
SR
S
I/O Cost
M
S (SID, Y, X
S
, FK)
R (RID, X
R
)
SR
Δ
Δ
R.RID = S.FK
X [X
S
X
R
]
T
Update
Model
Compute
Gradient
and Loss
w
( F, F)
Join (and
Project)
A
T (SID, Y,
X)
BGD
Figure 1: Learning over a join: (A) Schema and logical workflow. Feature vectors from S (e.g., Customers)
and R (e.g., Employers) are concatenated and used for BGD. The loss (F ) and gradient (F ) for BGD
can be computed together during a pass over the data. Approaches compared: Materialize (M), Stream
(S), Stream-Reuse (SR), and Factorized Learning (FL). High-level qualitative comparison of storage-runtime
tradeoffs and CPU-I/O cost tradeoffs for runtimes of the four approaches (S is assumed to be larger than R,
and the plots are not to scale) (B) When the hash table on R does not fit in buffer memory, S, SR, and M
require extra storage space for temporary tables or partitions. But, SR could be faster than FL due to lower
I/O costs. (C) When the hash table on R fits in buffer memory, but S does not, SR becomes similar to S
and neither need extra storage space, but both could be slower than FL. (D) When all data fit comfortably
in buffer memory, none of the approaches need extra storage space, and M could be faster than FL.
From a technical perspective, the issues that arise from
the redundancy present in a denormalized relation (used
for learning after joins) are well known in the context of
traditional relational data management [27]. But the im-
plications of this type of redundancy in the context of ML
algorithms are much less well understood. Thus, an im-
portant challenge to be addressed is if it is possible to de-
vise approaches that learn over joins and avoid introducing
such redundancy without sacrificing either the model qual-
ity, learning efficiency, or scalability compared to the cur-
rently standard approach of learning after joins.
As a first step, in this paper, we show that, for a large
generic class of ML techniques called Generalized Linear
Models (GLMs), it is possible to learn over joins and avoid
redundancy without sacrificing quality and scalability, while
actually improving performance. Furthermore, all our ap-
proaches to learn GLMs over joins are simple and easy to im-
plement using existing RDBMS abstractions, which makes
them more easily deployable than approaches that require
deep changes to the code of an RDBMS. We focus on GLMs
because they include many popular classification and regres-
sion techniques [17, 24]. We use standard gradient methods
to learn GLMs: Batch Gradient Descent (BGD), Conjugate
Gradient (CGD), and (L)BFGS [26]. For clarity of exposi-
tion, we use only BGD, but our results are also applicable
to these other gradient methods. BGD is a numerical opti-
mization algorithm that minimizes an objective function by
performing multiple passes (iterations) over the data.
Figure 1(A) gives a high-level overview of our problem.
We call the approach of materializing T before BGD as Ma-
terialize. We focus on the hybrid hash algorithm for the join
operation [31]. We assume that R is smaller in size than S
and estimate the I/O and CPU costs of all our approaches in
a manner similar to [31]. We propose three alternative ap-
proaches to run BGD over a join in a single-node RDBMS
setting Stream, Stream-Reuse and Factorized Learning.
Each approach avoids some forms of redundancy. Stream
avoids writing T and could save on I/O. Stream-Reuse also
exploits the fact that BGD is iterative and avoids reparti-
tioning of the base relations after the first iteration. But,
neither approach avoids redundancy in the computations for
BGD. Thus, we design the Factorized Learning (in short,
Factorize) approach that avoids computational redundancy
as well. Factorize achieves this by interleaving the compu-
tations and I/O of the join operation and BGD. None of
our approaches compromise on model quality. Furthermore,
they are all easy to implement in an RDBMS using the
abstraction of user-defined aggregate functions (UDAFs),
which provides scalability and ease of deployment [13, 16].
The performance picture, however, is more complex. Fig-
ures 1(B-D) give a high-level qualitative overview of the
tradeoff space for all our approaches in terms of the stor-
age space needed and the runtimes (split into I/O and CPU
costs). Both our analytical and experimental results show
that Factorize is often the fastest approach, but which ap-
proach is the fastest depends on a combination of factors
such as buffer memory, input table dimensions, and number
of iterations. Thus, a cost model such as ours is required to
select the fastest approach for a given instance of our prob-
lem. Furthermore, we identify that Factorize might face a
scalability bottleneck since it maintains an aggregation state
whose size is linear in the number of tuples in R. We pro-
pose three extensions to mitigate this bottleneck and find
that none of them dominate the others in terms of runtime,
which again necessitates our cost model.
We extend all our approaches to multi-table joins, specif-
ically, the case in which S has multiple foreign keys. Such
a scenario arises in applications such as recommendation
systems in which a table of ratings refers to both the user
and product tables [28]. We show that optimally extend-
ing Factorize to multi-table joins involves solving a problem
that is NP-Hard. We propose a simple, but effective, greedy
heuristic to tackle this problem. Finally, we extend all our
approaches to the shared-nothing parallel setting and im-
plement them on Hive. We find near-linear speedups and
scaleups for all our approaches.
In summary, our work makes the following contributions:
To the best of our knowledge, this is the first paper to
study the problem of learning over joins of large rela-
tions without materializing the join output. Focusing
on GLMs solved using BGD, we explain the tradeoff
space in terms of I/O and CPU costs and propose al-
ternative approaches to learn over joins.
We propose the Factorize approach that pushes BGD
computations through a join, while being amenable to
a simple implementation in existing RDBMSs.

Logistic Regression (LR)
Least-Squares Regression
(LSR), Lasso, and Ridge
Linear Support Vector
Machine (LSVM)
log(1 + e
ab
)
1 + e
ab
a
(a b)
2
2(b a)
max{0, 1 ab}
aδ
ab<1
ML Technique
F
e
(a, b)
(For Loss)
G (a, b)
(For Gradient)
Table 1: GLMs and their functions.
We compare the performance of all our approaches
both analytically and empirically using implementa-
tions on PostgreSQL. Our results show that Factorize
is often, but not always, the fastest approach. A com-
bination of factors such as the buffer memory, the di-
mensions of the input tables, and the number of BGD
iterations determines which approach is the fastest.
We also validate the accuracy of our analytical models.
We extend all our approaches to multi-table joins. We
also demonstrate how to parallelize them using imple-
mentations on Hive.
Outline. In Section 2, we present a brief background on
GLMs and BGD and some preliminaries for our problem. In
Section 3, we explain our cost model and simple approaches
to learn over joins. In Section 4, we present the new ap-
proach of Factorized Learning and its extensions. In Section
5, we discuss our experimental setup and results. We discuss
related work in Section 6 and conclude in Section 7.
2. BACKGROUND AND PRELIMINARIES
We provide a brief introduction to GLMs and BGD. For
a deeper description, we refer the reader to [17, 24, 26].
Generalized Linear Models (GLMs). Consider a dataset
of n examples, each of which includes a d-dimensional nu-
meric feature vector, x
i
, and a numeric target, y
i
(i = 1
to n). For regression, y
i
R, while for (binary) classifica-
tion, y
i
{−1, 1}. Loosely, GLMs assume that the data
points can be separated into its target classes (for clas-
sification), or approximated (for regression), by a hyper-
plane. The idea is to compute such a hyperplane w R
d
by
defining an optimization problem using the given dataset.
We are given a linearly-separable objective function that
computes the loss of a given model w R
d
on the data:
F (w) =
P
n
i=1
F
e
(y
i
, w
T
x
i
). The goal of an ML algorithm
is to minimize the loss function, i.e., find a vector w
R
d
,
s.t., w
= arg min
w
F (w). Table 1 lists examples of some
popular GLM techniques and their respective loss functions.
The loss functions of GLMs are convex (bowl-shaped), which
means any local minimum is a global minimum, and stan-
dard gradient descent algorithms can be used to solve them.
1
Batch Gradient Descent (BGD). BGD is a simple algo-
rithm to solve GLMs using iterative numerical optimization.
BGD initializes the model w to some w
0
, computes the gra-
dient F (w) on the given dataset, and updates the model as
w w αF (w), where α > 0 is the stepsize parameter.
The method is outlined in Algorithm 1. Like F , the gradient
is also linearly separable: F (w) =
P
n
i=1
G(y
i
, w
T
x
i
)x
i
.
Since the gradient is the direction of steepest ascent of F ,
1
Typically, a convex penalty term called a regularizer is
added to the loss to constrain kwk [17].
Algorithm 1 Batch Gradient Descent (BGD)
Inputs: {x
i
, y
i
}
n
i=1
(Data), w
0
(Initial model)
1: k 0, r
prev
null, r
curr
null, g
k
null
2: while (Stop (k, r
prev
, r
curr
, g
k
) = False) do
3: r
prev
r
curr
4: (g
k
, r
curr
) (F
k+1
, F
k+1
) 1 pass over data
5: w
k+1
w
k
α
k
g
k
Pick α
k
by line search
6: k k + 1
7: end while
BGD is also known as the method of steepest descent [26].
Table 1 also lists the gradient functions of the GLMs. We
shall use F and F (w) interchangeably.
BGD updates the model repeatedly, i.e., over many itera-
tions (or epochs), each of which requires (at least) one pass
over the data. The loss value typically drops over itera-
tions. The algorithm is typically stopped after a pre-defined
number of iterations, or when it converges (e.g., the drop in
the loss value across iterations, or the norm of the gradient,
falls below a given threshold). The stepsize parameter (α) is
typically tuned using a line search method that potentially
computes the loss many times (similar to step 4) [26].
On large data, it is likely that computing F and F
dominates the runtime of BGD [12, 13]. Fortunately, both
F and F can be computed scalably in a manner similar
to distributive aggregates like SUM in SQL. Thus, it is easy
to implement BGD using the abstraction of a user-defined
aggregate function (UDAF) that is available in almost all
RDBMSs [13, 16]. However, unlike SUM, BGD performs a
“multi-column” or vector aggregation since all feature values
of an example are needed to compute its contribution to the
gradient. For simplicity of exposition, we assume that fea-
ture vectors are instead stored as arrays in a single column.
Joins Before Learning. From our conversations with ana-
lysts at companies across various domains insurance, con-
sulting, Web search, security, and e-commerce we have
learned that analysts often perform joins to replace foreign
key references with actual feature values as part of their fea-
ture engineering effort.
2
In this work, we focus chiefly on a
two-table join. We term the main table with the entities to
learn on as the entity table (denoted S). We term the other
table as the attribute table (denoted R). A column in S is a
foreign key that refers to R.
Problem Statement. Suppose there are n
S
examples (tu-
ples) in S, and n
R
tuples in R. Assume that the feature
vectors are split across S and R, with d
S
1 features in X
S
and d
R
= d d
S
+1 in X
R
. Thus, the “width” of S is 2+ d
S
,
including the ID, foreign key, and target. The width of R
is 1 + d
R
, including the ID. Typically, we have n
S
n
R
,
similar to how fact tables have more tuples than dimension
tables in OLAP [16,27]. We now state our problem formally
(illustrated in Figure 1(A)).
Given two relations S (SID, Y, X
S
, F K) and R (RID, X
R
)
with a key-foreign key relationship (S.F K refers to R.RID),
where X
S
and X
R
are feature vectors and Y is the target,
learn a GLM using BGD over the result of the projected
2
An alternative is to simply ignore the foreign key, or treat
it as a large, sparse categorical feature. Such feature en-
gineering judgements are largely analyst-specific [7, 20, 33].
Our work simply aims to make feature engineering easier.

Symbol Meaning
R
Attribute table
S
Entity table
T
Join result table
n
R
Number of rows in R
n
S
Number of rows in S
d
R
Number of features in R
d
S
Number of features in S (includes Y)
p
Page size in bytes (1MB used)
m
Allocated buffer memory (pages)
f
Hash table fudge factor (1.4 used)
|R| Number of R pages
|S| Number of S pages
|T| Number of T pages
Iters
Number of iterations of BGD ( 1)
( )
p
8n
R
(1+d
R
)
( )
p
8n
S
(2+d
S
)
( )
p
8n
S
(1+d
S
+d
R
)
Table 2: Notation for objects and parameters used
in the cost models. I/O costs are counted in num-
ber of pages. Dividing by the disk throughput yields
the estimated runtimes. NB: As a simplifying as-
sumption, we use an 8B representation for all values:
IDs, target, and features (categorical features are as-
sumed be have been converted to numeric ones [17]).
equi-join T(SID, Y, [X
S
X
R
]) π(R
RID=F K
S) such
that the feature vector of a tuple in T is the concatenation
of the feature vectors from the joining tuples of S and R.
3. LEARNING OVER JOINS
We now discuss alternative approaches to run BGD over
a table that is logically the output of a key-foreign key join.
3.1 Assumptions and Cost Model
For the rest of the paper, we focus only on the data-
intensive computation in step 4 of Algorithm 1 computing
(F , F ). The data-agnostic computations of updating w
are identical across all approaches proposed here, and typ-
ically take only a few seconds.
3
Tables 2 and 3 summarize
our notation for the objects and parameters.
We focus on the classical hybrid hash join algorithm (con-
sidering other join algorithms is part of future work), which
requires (m 1) >
p
df|R|e [31]. We also focus primarily
on the case n
S
> n
R
and |S| |R|. We discuss the cases
n
S
n
R
or |S| < |R| in the appendix.
3.2 BGD After a Join: Materialize (M)
Materialize (M) is the current popular approach for han-
dling ML over normalized datasets. Essentially, we write a
new table and use it for BGD.
1. Apply hybrid hash join to obtain and write T.
2. Read T to compute (F, F ) for each iteration.
Following the style of the discussion of the hybrid hash join
algorithm in [31], we now introduce some notation. The
number of partitions of R is B =
l
df |R|e−(m2)
(m2)1
m
. Partition
3
CGD and (L)BFGS differ from BGD only in these data-
agnostic computations, which are easily implemented in, say,
Python, or R [12]. If a line search is used to tune α, we need
to compute only F , but largely the same tradeoffs apply.
Symbol Meaning
Default Value
(CPU Cycles)
hash Hash a key 100
comp Compare two keys 10
copy Copy a double 1
add Add two doubles 10
mult Multiply two doubles 10
funcG Compute G(a, b) 150
funcF
Compute F
e
(a, b)
200
Table 3: Notation for the CPU cost model. The
approximate default values for CPU cycles for each
unit of the cost model were estimated empirically
on the machine on which the experiments were run.
Dividing by the CPU clock frequency yields the esti-
mated runtimes. For G and F
e
, we assume LR. LSR
and LSVM are slightly faster.
sizes are |R
0
| =
j
(m2)B
f
k
, and |R
i
| =
l
|R|−|R
0
|
B
m
(1 i
B), with the ratio q =
|R
0
|
|R|
, where R
0
is the first partition
and R
i
are the other partitions as per the hybrid hash join
algorithm [31]. We provide the detailed I/O and CPU costs
of Materialize here. The costs of the other approaches in
this section can be derived from these, but due to space
constraints, we present them in the appendix.
I/O Cost. If (m 1) df |R|e, we partition the tables:
(|R|+|S|) //First read
+ 2.(|R|+|S|).(1-q) //Write, read temp partitions
+ |T| //Write output
+ |T| //Read for first iteration
+ (Iters-1).|T| //Remaining iterations
- min{|T|,[(m-2)-f|R
i
|]} //Cache T for iter 1
- min{|T|,(m-1)}.(Iters-1) //MRU for rest
If (m 1) > df |R|e, we need not partition the tables:
(|R|+|S|)
+ (Iters+1).|T|
- min{|T|,[(m-2)-f|R|]}
- min{|T|,(m-1)}.(Iters-1)
CPU Cost
(nR+nS).hash //Partition R and S
+ nR.(1+dR).copy //Construct hash on R
+ nR.(1+dR).(1-q).copy //R output partitions
+ nS.(2+dS).(1-q).copy //S output partitions
+ (nR+nS).(1-q).hash //Hash on R and S partitions
+ nS.comp.f //Probe for all of S
+ nS.(1+dS+dR).copy //T output partitions
+ Iters.[ //Compute gradient and loss
nS.d.(mult+add) //w.xi for all i
+ nS.(funcG+funcF) //Apply G and F_e
+ nS.d.(mult+add) //Scale and add
+ nS.add //Add for total loss
]
3.3 BGD Over a Join: Stream (S)
This approach performs the join lazily for each iteration.
1. Apply hybrid hash join to obtain T, but instead of
writing T, compute (F, F ) on the fly.
2. Repeat step 1 for each iteration.

0 2 4 6 8 10
0
2
4
6
8
10
100
10
1
0.1
0 20 40 60 80 100
0
2
4
6
8
10
n
S
/n
R
=
100
10
1
0.1
Feature Ratio (d
R
/d
S
)
Redundancy Ratio
Redundancy Ratio
Tuple Ratio (n
S
/n
R
)
d
R
/d
S
=
10
5
1
0.1
A
B
Figure 2: Redundancy ratio against the two dimen-
sion ratios (for d
S
= 20). (A) Fix
d
R
d
S
and vary
n
S
n
R
.
(B) Fix
n
S
n
R
and vary
d
R
d
S
.
The I/O cost of Stream is simply the cost of the hybrid hash
join multiplied by the number of iterations. Its CPU cost is
a combination of the join and BGD.
Discussion of Tradeoffs. The I/O and storage tradeoffs
between Materialize and Stream (Figure 1(B)) arise because
it is likely that many tuples of S join with a single tuple
of R (e.g., many customers might have the same employer).
Thus, |T| is usually larger than |S|+|R|. Obviously, the gap
depends upon the dataset sizes. More precisely, we define
the redundancy ratio (r) as the ratio of the size of T to that
of S and R:
r =
n
S
(1 + d
S
+ d
R
)
n
S
(2 + d
S
) + n
R
(1 + d
R
)
=
n
S
n
R
(1 +
d
R
d
S
+
1
d
S
)
n
S
n
R
(1 +
2
d
S
) +
d
R
d
S
+
1
d
S
This ratio is useful because it gives us an idea of the factor
of speedups that are potentially possible by learning over
joins. Since it depends on the dimensions of the inputs, we
plot the redundancy ratio for different values of the tuple
ratio (
n
S
n
R
) and (inverse) feature ratio (
d
R
d
S
), while fixing d
S
.
Figure 2 presents the plots. Typically, both dimension ratios
are > 1, which mostly yields r > 1. But when the tuple ratio
is < 1, r < 1 (see Figure 2(A)). This is because the join here
becomes selective (when n
S
< n
R
). However, when the
tuple ratio > 1, we see that r increases with the tuple ratio.
It converges to
1+
d
R
d
S
+
1
d
S
1+
2
d
S
1 +
d
R
d
S
. Similarly, as shown in
Figure 2(B), the redundancy ratio increases with the feature
ratio, and converges to the tuple ratio
n
S
n
R
.
3.4 An Improvement: Stream-Reuse (SR)
We now present a simple modification to Stream the
Stream-Reuse approach that can significantly improve per-
formance.
1. Apply hybrid hash join to obtain T, but instead of
writing T, run the first iteration of BGD on the fly.
2. Maintain the temporary partitions of S and R on disk.
3. For the remaining iterations, reuse the partitions of S
and and R for the hybrid hash join, similar to step 1.
The I/O cost of Stream-Reuse gets rid of the rewriting (and
rereading) of partitions at every iteration, but the CPU
cost is reduced only slightly. Stream-Reuse makes the join
“iteration-aware” we need to divide the implementation
of the hybrid hash join in to two steps so as to reuse the
partitions across iterations. An easier way to implement
(without changing the RDBMS code) is to manually handle
pre-partitioning at the logical query layer after consulting
S
R
Δ
Δ
w
HR
γ
SUM
( F
S
, F)
γ
SUM
(RID)
HS
R
Δ
Δ
γ
SUM
F
R
Logical Schemas:
R(RID, X
R
)
S(SID, Y, X
S
, FK)
HR(RID, PartialIP)
HS(RID, SumScaledIP)
1
2
3
Figure 3: Logical workflow of factorized learning,
consisting of three steps as numbered. HR and HS
are logical intermediate relations. PartialIP refers
to the partial inner products from R. SumScaledIP
refers to the grouped sums of the scalar output of
G() applied to the full inner products on the con-
catenated feature vectors. Here, γ
SUM
denotes a SUM
aggregation and γ
SUM
(RID) denotes a SUM aggrega-
tion with a GROUP BY on RID.
the optimizer about the number of partitions. Although
the latter is a minor approximation to SR, the difference in
performance (estimated using our analytical cost models) is
mostly negligible.
4. FACTORIZED LEARNING
We now present a new technique that interleaves the I/O
and CPU processing of the join and BGD. The basic idea
is to avoid the redundancy introduced by the join by divid-
ing the computations of both F and F and “pushing them
through the join”. We call our technique factorized learn-
ing (Factorize, or FL for short), borrowing the terminology
from “factorized” databases [8]. An overview of the logical
computations in FL is presented in Figure 3.
The key insight in FL is as follows: given a feature vector
x T, we have w
T
x = w
T
S
x
S
+ w
T
R
x
R
. Since the join
duplicates x
R
from R when constructing T, the main goal of
FL is to avoid redundant inner product computations as well
as I/O over those feature vectors from R. FL achieves this
goal with the following three steps (numbered in Figure 3).
1. Compute and save the partial inner products w
T
R
x
R
for
each tuple in R in a new table HR under the PartialIP
column (part 1 in Figure 3).
2. Recall that the computation of F and F are clubbed
together, and that F [F
S
F
R
]. This step com-
putes F and F
S
together. Essentially, we join HR
and S on RID and complete the computation of the
full inner products on the fly, and follow that up by
applying both F
e
() and G() on each example. By ag-
gregating both these quantities as it performs the join,
FL completes the computation of F =
P
F
e
(y, w
T
x)
and F
S
=
P
G(y, w
T
x)x
S
. Simultaneously, FL also
performs a GROUP BY on RID and sums up G(y, w
T
x),
which is saved in a new table HS under the Sum-
ScaledIP column (part 2 in Figure 3).
3. Compute F
R
=
P
G(y, w
T
x)x
R
by joining HS with
R on RID and scaling the partial feature vectors x
R
with SumScaledIP (part 3 in Figure 3).
Example: Consider logistic regression (LR). In step 2, as
the full inner product w
T
x is computed by joining HR and

Figures
Citations
More filters
Journal ArticleDOI

Machine learning

TL;DR: Machine learning addresses many of the same research questions as the fields of statistics, data mining, and psychology, but with differences of emphasis.
Proceedings ArticleDOI

Heterogeneity-aware Distributed Parameter Servers

TL;DR: A heterogeneity-aware algorithm that uses a constant learning rate schedule for updates before adding them to the global parameter allows us to suppress stragglers' harm on robust convergence and theoretically prove the valid convergence of both approaches.
Proceedings ArticleDOI

Learning Linear Regression Models over Factorized Joins

TL;DR: A new paradigm for computing batch gradient descent is proposed that exploits the factorized computation and representation of the training datasets, a rewriting of the regression objective function that decouples the computation of cofactors of model parameters from their convergence, and the commutativity of cofactor computation with relational union and projection.
Journal ArticleDOI

Model Selection Management Systems: The Next Frontier of Advanced Analytics

TL;DR: A model enabling the development and maintenance of situation-aware applications in a declarative and therefore economical manner is developed, called KIDS - Knowledge Intensive Data-processing System.
Proceedings ArticleDOI

Materialization optimizations for feature selection workloads

TL;DR: It is argued that managing the feature selection process is a pressing data management challenge, and it is shown that it is possible to build a simple cost-based optimizer to automatically select a near-optimal execution plan for feature selection.
References
More filters
Book

Computers and Intractability: A Guide to the Theory of NP-Completeness

TL;DR: The second edition of a quarterly column as discussed by the authors provides a continuing update to the list of problems (NP-complete and harder) presented by M. R. Garey and myself in our book "Computers and Intractability: A Guide to the Theory of NP-Completeness,” W. H. Freeman & Co., San Francisco, 1979.
Book

The Elements of Statistical Learning: Data Mining, Inference, and Prediction

TL;DR: In this paper, the authors describe the important ideas in these areas in a common conceptual framework, and the emphasis is on concepts rather than mathematics, with a liberal use of color graphics.
Book

Numerical Optimization

TL;DR: Numerical Optimization presents a comprehensive and up-to-date description of the most effective methods in continuous optimization, responding to the growing interest in optimization in engineering, science, and business by focusing on the methods that are best suited to practical problems.
Journal ArticleDOI

Machine learning

TL;DR: Machine learning addresses many of the same research questions as the fields of statistics, data mining, and psychology, but with differences of emphasis.
Related Papers (5)
Frequently Asked Questions (12)
Q1. What contributions have the authors mentioned in the paper "Learning generalized linear models over normalized data" ?

This strategy of learning after joins introduces redundancy avoided by normalization, which could lead to poorer end-to-end performance and maintenance overheads due to data duplication. In this work, the authors take a step towards enabling and optimizing learning over joins for a common class of machine learning techniques called generalized linear models that are solved using gradient descent algorithms in an RDBMS setting. The authors present alternative approaches to learn over a join that are easy to implement over existing RDBMSs. The authors introduce a new approach named factorized learning that pushes ML computations through joins and avoids redundancy in both I/O and computations. The authors study the tradeoff space for all their approaches both analytically and empirically. Their results show that factorized learning is often substantially faster than the alternatives, but is not always the fastest, necessitating a cost-based approach. The authors also discuss extensions of all their approaches to multi-table joins as well as to Hive. 

As for future work, the authors are working on extending factorized learning to other popular algorithms to solve GLMs such as stochastic gradient descent and coordinate descent methods. Since the data access behavior of these techniques might differ from that of GLMs with BGD, it is not clear if it is straightforward to extend their ideas to these techniques. 

Since the CPU cost of BGD increases with the dimension ratios, FL, which reduces the computations for BGD, is faster than M at higher values of both ratios. 

By harnessing prior work from the database literature, and avoiding explicit encoding of redundancy information, the authors handle datasets that may not fit in memory. 

She joins the two tables on the EmployerID as part of her “feature engineering” because she thinks the features of the employer might be helpful in predicting how likely a customer is to churn. 

These include systems that combine linear algebra-based languages with data management platforms [4, 15, 34], systems for Bayesian inference [9], systems for graph-based ML [23], and systems that combine dataflow-based languages for ML with data management platforms [21, 22, 33]. 

From a technical perspective, the issues that arise from the redundancy present in a denormalized relation (used for learning after joins) are well known in the context of traditional relational data management [27]. 

The authors think it is interesting future work to improve the absolute accuracy of their cost model, say, by making their models more fine-grained, and by performing a more careful calibration. 

The stepsize parameter (α) is typically tuned using a line search method that potentially computes the loss many times (similar to step 4) [26]. 

The I/O and storage tradeoffs between Materialize and Stream (Figure 1(B)) arise because it is likely that many tuples of S join with a single tuple of R (e.g., many customers might have the same employer). 

an important challenge to be addressed is if it is possible to devise approaches that learn over joins and avoid introducing such redundancy without sacrificing either the model quality, learning efficiency, or scalability compared to the currently standard approach of learning after joins. 

Their main goal for their analytical models was to understand the fine-grained behavior of each approach and to enable us to quickly explore the relative performance trends of them all for different parameter settings.