What contributions have the authors mentioned in the paper "Learning generalized linear models over normalized data" ?

This strategy of learning after joins introduces redundancy avoided by normalization, which could lead to poorer end-to-end performance and maintenance overheads due to data duplication. In this work, the authors take a step towards enabling and optimizing learning over joins for a common class of machine learning techniques called generalized linear models that are solved using gradient descent algorithms in an RDBMS setting. The authors present alternative approaches to learn over a join that are easy to implement over existing RDBMSs. The authors introduce a new approach named factorized learning that pushes ML computations through joins and avoids redundancy in both I/O and computations. The authors study the tradeoff space for all their approaches both analytically and empirically. Their results show that factorized learning is often substantially faster than the alternatives, but is not always the fastest, necessitating a cost-based approach. The authors also discuss extensions of all their approaches to multi-table joins as well as to Hive.

What future works have the authors mentioned in the paper "Learning generalized linear models over normalized data" ?

As for future work, the authors are working on extending factorized learning to other popular algorithms to solve GLMs such as stochastic gradient descent and coordinate descent methods. Since the data access behavior of these techniques might differ from that of GLMs with BGD, it is not clear if it is straightforward to extend their ideas to these techniques.

Why is FL faster than M at higher values of both ratios?

Since the CPU cost of BGD increases with the dimension ratios, FL, which reduces the computations for BGD, is faster than M at higher values of both ratios.

How do the authors handle datasets that may not fit in memory?

By harnessing prior work from the database literature, and avoiding explicit encoding of redundancy information, the authors handle datasets that may not fit in memory.

What are some of the systems that combine ML with data management platforms?

These include systems that combine linear algebra-based languages with data management platforms [4, 15, 34], systems for Bayesian inference [9], systems for graph-based ML [23], and systems that combine dataflow-based languages for ML with data management platforms [21, 22, 33].

What is the way to improve the accuracy of their cost model?

The authors think it is interesting future work to improve the absolute accuracy of their cost model, say, by making their models more fine-grained, and by performing a more careful calibration.

What is the significance of learning over joins?

an important challenge to be addressed is if it is possible to devise approaches that learn over joins and avoid introducing such redundancy without sacrificing either the model quality, learning efficiency, or scalability compared to the currently standard approach of learning after joins.

Why did the authors use the cost model to predict the runtime trends of each approach?

Their main goal for their analytical models was to understand the fine-grained behavior of each approach and to enable us to quickly explore the relative performance trends of them all for different parameter settings.

(Open Access) Learning Generalized Linear Models Over Normalized Data (2015) | Arun Kumar

Q: what is the i/o tradeoff between materialize and stream?

The I/O and storage tradeoffs between Materialize and Stream (Figure 1(B)) arise because it is likely that many tuples of S join with a single tuple of R (e.g., many customers might have the same employer).

Learning Generalized Linear Models Over Normalized Data

Arun Kumar Jeffrey Naughton Jignesh M. Patel

Department of Computer Sciences,

University of Wisconsin-Madison

{arun, naughton, jignesh}@cs.wisc.edu

ABSTRACT

Enterprise data analytics is a booming area in the data man-

agement industry. Many companies are racing to develop

toolkits that closely integrate statistical and machine learn-

ing techniques with data management systems. Almost all

such toolkits assume that the input to a learning algorithm

is a single table. However, most relational datasets are not

stored as single tables due to normalization. Thus, analysts

often perform key-foreign key joins before learning on the

join output. This strategy of learning after joins introduces

redundancy avoided by normalization, which could lead to

poorer end-to-end performance and maintenance overheads

due to data duplication. In this work, we take a step towards

enabling and optimizing learning over joins for a common

class of machine learning techniques called generalized linear

models that are solved using gradient descent algorithms in

an RDBMS setting. We present alternative approaches to

learn over a join that are easy to implement over existing

RDBMSs. We introduce a new approach named factorized

learning that pushes ML computations through joins and

avoids redundancy in both I/O and computations. We study

the tradeoﬀ space for all our approaches both analytically

and empirically. Our results show that factorized learning

is often substantially faster than the alternatives, but is not

always the fastest, necessitating a cost-based approach. We

also discuss extensions of all our approaches to multi-table

joins as well as to Hive.

Categories and Subject Descriptors

H.2 [Information Systems]: Database Management

Keywords

Analytics; feature engineering; joins; machine learning

1. INTRODUCTION

There is an escalating arms race to bring sophisticated sta-

tistical and machine learning (ML) techniques to enterprise

applications [3, 5]. A number of projects in both industry

Permission to make digital or hard copies of all or part of this work for personal or

classroom use is granted without fee provided that copies are not made or distributed

for proﬁt or commercial advantage and that copies bear this notice and the full cita-

tion on the ﬁrst page. Copyrights for components of this work owned by others than

ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or re-

publish, to post on servers or to redistribute to lists, requires prior speciﬁc permission

and/or a fee. Request permissions from permissions@acm.org.

SIGMOD’15, May 31 – June 04, 2015, Melbourne, VIC, Australia.

http://dx.doi.org/10.1145/2723372.2723713.

and academia aim to integrate ML capabilities with data

processing in RDBMSs, Hadoop, and other systems [2, 4, 9,

15,18,21,22,33,34]. Almost all such implementations of ML

algorithms require that the input dataset be a single table.

However, most relational datasets are not as stored single

tables due to normalization [27]. Thus, analysts often per-

form key-foreign key joins of the base tables and materialize

a single temporary table that is used as the input to the ML

algorithm, i.e., they learn after joins.

Example: Consider an insurance company analyst model-

ing customer churn (will a customer leave the company or

not) – a standard classiﬁcation task. She builds a logistic re-

gression model using the large table that stores customer de-

tails: Customers(CustomerID

, Churn, Age, Income, . . . ,

EmployerID). Note that one of the features, EmployerID, is

the ID of the customer’s employer. It is a foreign key that

refers to a separate table that stores details about compa-

nies and other organizations: Employers(EmployerID, Rev-

enue, NumEmployees, . . . ). She joins the two tables on the

EmployerID as part of her “feature engineering” because she

thinks the features of the employer might be helpful in pre-

dicting how likely a customer is to churn. For example, she

might have a hunch that customers employed by large cor-

porations are less likely to churn. She writes the output of

the join as a single temporary table and feeds it to an ML

toolkit that implements logistic regression.

Similar examples arise in a number of other application

domains, e.g., detecting malicious users by joining data about

user accounts with account activities, predicting census mail

response rates by joining data about census districts with

individual households, recommending products by joining

data about past ratings with users and products, etc.

Learning after joins imposes an artiﬁcial barrier between

the ML-based analysis and the base relations, resulting in

several practical issues. First, the table obtained after the

join can be much larger than the base tables themselves

because the join introduces redundancy that was originally

removed by database normalization [8, 27]. This results in

unnecessary overheads for storage and performance as well

as waste of time performing extra computations on data with

redundancy. Second, as the base tables evolve, maintaining

the materialized output of the join could become an over-

head. Finally, analysts often perform exploratory analysis

of diﬀerent subsets of features and data [20, 33]. Materializ-

ing temporary tables after joins for learning on each subset

could slow the analyst and inhibit exploration [7]. Learning

over joins, i.e., pushing ML computations through joins to

the base tables, mitigates such drawbacks.

I/O Cost

CPU Cost

Storage

Runtime

I/O Cost

CPU Cost

Storage

Runtime

CPU Cost

Storage

Runtime

I/O Cost

S (SID, Y, X

, FK)

R (RID, X

)

R.RID = S.FK

X ≡ [X

]

Update

Model

Compute

Gradient

and Loss

∆

( F, F)

Join (and

Project)

T (SID, Y,

BGD

Figure 1: Learning over a join: (A) Schema and logical workﬂow. Feature vectors from S (e.g., Customers)

and R (e.g., Employers) are concatenated and used for BGD. The loss (F ) and gradient (∇F ) for BGD

can be computed together during a pass over the data. Approaches compared: Materialize (M), Stream

(S), Stream-Reuse (SR), and Factorized Learning (FL). High-level qualitative comparison of storage-runtime

tradeoﬀs and CPU-I/O cost tradeoﬀs for runtimes of the four approaches (S is assumed to be larger than R,

and the plots are not to scale) (B) When the hash table on R does not ﬁt in buﬀer memory, S, SR, and M

require extra storage space for temporary tables or partitions. But, SR could be faster than FL due to lower

I/O costs. (C) When the hash table on R ﬁts in buﬀer memory, but S does not, SR becomes similar to S

and neither need extra storage space, but both could be slower than FL. (D) When all data ﬁt comfortably

in buﬀer memory, none of the approaches need extra storage space, and M could be faster than FL.

From a technical perspective, the issues that arise from

the redundancy present in a denormalized relation (used

for learning after joins) are well known in the context of

traditional relational data management [27]. But the im-

plications of this type of redundancy in the context of ML

algorithms are much less well understood. Thus, an im-

portant challenge to be addressed is if it is possible to de-

vise approaches that learn over joins and avoid introducing

such redundancy without sacriﬁcing either the model qual-

ity, learning eﬃciency, or scalability compared to the cur-

rently standard approach of learning after joins.

As a ﬁrst step, in this paper, we show that, for a large

generic class of ML techniques called Generalized Linear

Models (GLMs), it is possible to learn over joins and avoid

redundancy without sacriﬁcing quality and scalability, while

actually improving performance. Furthermore, all our ap-

proaches to learn GLMs over joins are simple and easy to im-

plement using existing RDBMS abstractions, which makes

them more easily deployable than approaches that require

deep changes to the code of an RDBMS. We focus on GLMs

because they include many popular classiﬁcation and regres-

sion techniques [17, 24]. We use standard gradient methods

to learn GLMs: Batch Gradient Descent (BGD), Conjugate

Gradient (CGD), and (L)BFGS [26]. For clarity of exposi-

tion, we use only BGD, but our results are also applicable

to these other gradient methods. BGD is a numerical opti-

mization algorithm that minimizes an objective function by

performing multiple passes (iterations) over the data.

Figure 1(A) gives a high-level overview of our problem.

We call the approach of materializing T before BGD as Ma-

terialize. We focus on the hybrid hash algorithm for the join

operation [31]. We assume that R is smaller in size than S

and estimate the I/O and CPU costs of all our approaches in

a manner similar to [31]. We propose three alternative ap-

proaches to run BGD over a join in a single-node RDBMS

setting – Stream, Stream-Reuse and Factorized Learning.

Each approach avoids some forms of redundancy. Stream

avoids writing T and could save on I/O. Stream-Reuse also

exploits the fact that BGD is iterative and avoids reparti-

tioning of the base relations after the ﬁrst iteration. But,

neither approach avoids redundancy in the computations for

BGD. Thus, we design the Factorized Learning (in short,

Factorize) approach that avoids computational redundancy

as well. Factorize achieves this by interleaving the compu-

tations and I/O of the join operation and BGD. None of

our approaches compromise on model quality. Furthermore,

they are all easy to implement in an RDBMS using the

abstraction of user-deﬁned aggregate functions (UDAFs),

which provides scalability and ease of deployment [13, 16].

The performance picture, however, is more complex. Fig-

ures 1(B-D) give a high-level qualitative overview of the

tradeoﬀ space for all our approaches in terms of the stor-

age space needed and the runtimes (split into I/O and CPU

costs). Both our analytical and experimental results show

that Factorize is often the fastest approach, but which ap-

proach is the fastest depends on a combination of factors

such as buﬀer memory, input table dimensions, and number

of iterations. Thus, a cost model such as ours is required to

select the fastest approach for a given instance of our prob-

lem. Furthermore, we identify that Factorize might face a

scalability bottleneck since it maintains an aggregation state

whose size is linear in the number of tuples in R. We pro-

pose three extensions to mitigate this bottleneck and ﬁnd

that none of them dominate the others in terms of runtime,

which again necessitates our cost model.

We extend all our approaches to multi-table joins, specif-

ically, the case in which S has multiple foreign keys. Such

a scenario arises in applications such as recommendation

systems in which a table of ratings refers to both the user

and product tables [28]. We show that optimally extend-

ing Factorize to multi-table joins involves solving a problem

that is NP-Hard. We propose a simple, but eﬀective, greedy

heuristic to tackle this problem. Finally, we extend all our

approaches to the shared-nothing parallel setting and im-

plement them on Hive. We ﬁnd near-linear speedups and

scaleups for all our approaches.

In summary, our work makes the following contributions:

• To the best of our knowledge, this is the ﬁrst paper to

study the problem of learning over joins of large rela-

tions without materializing the join output. Focusing

on GLMs solved using BGD, we explain the tradeoﬀ

space in terms of I/O and CPU costs and propose al-

ternative approaches to learn over joins.

• We propose the Factorize approach that pushes BGD

computations through a join, while being amenable to

a simple implementation in existing RDBMSs.

Logistic Regression (LR)

Least-Squares Regression

(LSR), Lasso, and Ridge

Linear Support Vector

Machine (LSVM)

log(1 + e

–ab

)

1 + e

–a

(a – b)

2(b – a)

max{0, 1 – ab}

–aδ

ab<1

ML Technique

(a, b)

(For Loss)

G (a, b)

(For Gradient)

Table 1: GLMs and their functions.

• We compare the performance of all our approaches

both analytically and empirically using implementa-

tions on PostgreSQL. Our results show that Factorize

is often, but not always, the fastest approach. A com-

bination of factors such as the buﬀer memory, the di-

mensions of the input tables, and the number of BGD

iterations determines which approach is the fastest.

We also validate the accuracy of our analytical models.

• We extend all our approaches to multi-table joins. We

also demonstrate how to parallelize them using imple-

mentations on Hive.

Outline. In Section 2, we present a brief background on

GLMs and BGD and some preliminaries for our problem. In

Section 3, we explain our cost model and simple approaches

to learn over joins. In Section 4, we present the new ap-

proach of Factorized Learning and its extensions. In Section

5, we discuss our experimental setup and results. We discuss

related work in Section 6 and conclude in Section 7.

2. BACKGROUND AND PRELIMINARIES

We provide a brief introduction to GLMs and BGD. For

a deeper description, we refer the reader to [17, 24, 26].

Generalized Linear Models (GLMs). Consider a dataset

of n examples, each of which includes a d-dimensional nu-

meric feature vector, x

, and a numeric target, y

(i = 1

to n). For regression, y

∈ R, while for (binary) classiﬁca-

tion, y

∈ {−1, 1}. Loosely, GLMs assume that the data

points can be separated into its target classes (for clas-

siﬁcation), or approximated (for regression), by a hyper-

plane. The idea is to compute such a hyperplane w ∈ R

deﬁning an optimization problem using the given dataset.

We are given a linearly-separable objective function that

computes the loss of a given model w ∈ R

on the data:

F (w) =

i=1

, w

). The goal of an ML algorithm

is to minimize the loss function, i.e., ﬁnd a vector w

∗

∈ R

s.t., w

∗

= arg min

F (w). Table 1 lists examples of some

popular GLM techniques and their respective loss functions.

The loss functions of GLMs are convex (bowl-shaped), which

means any local minimum is a global minimum, and stan-

dard gradient descent algorithms can be used to solve them.

Batch Gradient Descent (BGD). BGD is a simple algo-

rithm to solve GLMs using iterative numerical optimization.

BGD initializes the model w to some w

, computes the gra-

dient ∇F (w) on the given dataset, and updates the model as

w ← w − α∇F (w), where α > 0 is the stepsize parameter.

The method is outlined in Algorithm 1. Like F , the gradient

is also linearly separable: ∇F (w) =

i=1

G(y

, w

Since the gradient is the direction of steepest ascent of F ,

Typically, a convex penalty term called a regularizer is

added to the loss to constrain kwk [17].

Algorithm 1 Batch Gradient Descent (BGD)

Inputs: {x

, y

}

i=1

(Data), w

(Initial model)

1: k ← 0, r

← null, r

curr

← null, g

← null

2: while (Stop (k, r

, r

curr

, g

) = False) do

3: r

← r

curr

4: (g

, r

curr

) ← (∇F

k+1

, F

k+1

)  1 pass over data

5: w

k+1

← w

− α

 Pick α

by line search

6: k ← k + 1

7: end while

BGD is also known as the method of steepest descent [26].

Table 1 also lists the gradient functions of the GLMs. We

shall use F and F (w) interchangeably.

BGD updates the model repeatedly, i.e., over many itera-

tions (or epochs), each of which requires (at least) one pass

over the data. The loss value typically drops over itera-

tions. The algorithm is typically stopped after a pre-deﬁned

number of iterations, or when it converges (e.g., the drop in

the loss value across iterations, or the norm of the gradient,

falls below a given threshold). The stepsize parameter (α) is

typically tuned using a line search method that potentially

computes the loss many times (similar to step 4) [26].

On large data, it is likely that computing F and ∇F

dominates the runtime of BGD [12, 13]. Fortunately, both

F and ∇F can be computed scalably in a manner similar

to distributive aggregates like SUM in SQL. Thus, it is easy

to implement BGD using the abstraction of a user-deﬁned

aggregate function (UDAF) that is available in almost all

RDBMSs [13, 16]. However, unlike SUM, BGD performs a

“multi-column” or vector aggregation since all feature values

of an example are needed to compute its contribution to the

gradient. For simplicity of exposition, we assume that fea-

ture vectors are instead stored as arrays in a single column.

Joins Before Learning. From our conversations with ana-

lysts at companies across various domains – insurance, con-

sulting, Web search, security, and e-commerce – we have

learned that analysts often perform joins to replace foreign

key references with actual feature values as part of their fea-

ture engineering eﬀort.

In this work, we focus chieﬂy on a

two-table join. We term the main table with the entities to

learn on as the entity table (denoted S). We term the other

table as the attribute table (denoted R). A column in S is a

foreign key that refers to R.

Problem Statement. Suppose there are n

examples (tu-

ples) in S, and n

tuples in R. Assume that the feature

vectors are split across S and R, with d

− 1 features in X

and d

= d− d

+1 in X

. Thus, the “width” of S is 2+ d

including the ID, foreign key, and target. The width of R

is 1 + d

, including the ID. Typically, we have n

 n

similar to how fact tables have more tuples than dimension

tables in OLAP [16,27]. We now state our problem formally

(illustrated in Figure 1(A)).

Given two relations S (SID, Y, X

, F K) and R (RID, X

)

with a key-foreign key relationship (S.F K refers to R.RID),

where X

and X

are feature vectors and Y is the target,

learn a GLM using BGD over the result of the projected

An alternative is to simply ignore the foreign key, or treat

it as a large, sparse categorical feature. Such feature en-

gineering judgements are largely analyst-speciﬁc [7, 20, 33].

Our work simply aims to make feature engineering easier.

Symbol Meaning

Attribute table

Entity table

Join result table

Number of rows in R

Number of rows in S

Number of features in R

Number of features in S (includes Y)

Page size in bytes (1MB used)

Allocated buffer memory (pages)

Hash table fudge factor (1.4 used)

|R| Number of R pages

|S| Number of S pages

|T| Number of T pages

Iters

Number of iterations of BGD (≥ 1)

( )

(1+d

)

( )

(2+d

)

( )

(1+d

)

Table 2: Notation for objects and parameters used

in the cost models. I/O costs are counted in num-

ber of pages. Dividing by the disk throughput yields

the estimated runtimes. NB: As a simplifying as-

sumption, we use an 8B representation for all values:

IDs, target, and features (categorical features are as-

sumed be have been converted to numeric ones [17]).

equi-join T(SID, Y, [X

]) ← π(R 

RID=F K

S) such

that the feature vector of a tuple in T is the concatenation

of the feature vectors from the joining tuples of S and R.

3. LEARNING OVER JOINS

We now discuss alternative approaches to run BGD over

a table that is logically the output of a key-foreign key join.

3.1 Assumptions and Cost Model

For the rest of the paper, we focus only on the data-

intensive computation in step 4 of Algorithm 1 – computing

(∇F , F ). The data-agnostic computations of updating w

are identical across all approaches proposed here, and typ-

ically take only a few seconds.

Tables 2 and 3 summarize

our notation for the objects and parameters.

We focus on the classical hybrid hash join algorithm (con-

sidering other join algorithms is part of future work), which

requires (m − 1) >

df|R|e [31]. We also focus primarily

on the case n

> n

and |S| ≥ |R|. We discuss the cases

≤ n

or |S| < |R| in the appendix.

3.2 BGD After a Join: Materialize (M)

Materialize (M) is the current popular approach for han-

dling ML over normalized datasets. Essentially, we write a

new table and use it for BGD.

1. Apply hybrid hash join to obtain and write T.

2. Read T to compute (∇F, F ) for each iteration.

Following the style of the discussion of the hybrid hash join

algorithm in [31], we now introduce some notation. The

number of partitions of R is B =

df |R|e−(m−2)

(m−2)−1

. Partition

CGD and (L)BFGS diﬀer from BGD only in these data-

agnostic computations, which are easily implemented in, say,

Python, or R [12]. If a line search is used to tune α, we need

to compute only F , but largely the same tradeoﬀs apply.

Symbol Meaning

Default Value

(CPU Cycles)

hash Hash a key 100

comp Compare two keys 10

copy Copy a double 1

add Add two doubles 10

mult Multiply two doubles 10

funcG Compute G(a, b) 150

funcF

Compute F

(a, b)

200

Table 3: Notation for the CPU cost model. The

approximate default values for CPU cycles for each

unit of the cost model were estimated empirically

on the machine on which the experiments were run.

Dividing by the CPU clock frequency yields the esti-

mated runtimes. For G and F

, we assume LR. LSR

and LSVM are slightly faster.

sizes are |R

| =

(m−2)−B

, and |R

| =

|R|−|R

(1 ≤ i ≤

B), with the ratio q =

|R|

, where R

is the ﬁrst partition

and R

are the other partitions as per the hybrid hash join

algorithm [31]. We provide the detailed I/O and CPU costs

of Materialize here. The costs of the other approaches in

this section can be derived from these, but due to space

constraints, we present them in the appendix.

I/O Cost. If (m − 1) ≤ df |R|e, we partition the tables:

(|R|+|S|) //First read

+ 2.(|R|+|S|).(1-q) //Write, read temp partitions

+ |T| //Write output

+ |T| //Read for first iteration

+ (Iters-1).|T| //Remaining iterations

- min{|T|,[(m-2)-f|R

|]} //Cache T for iter 1

- min{|T|,(m-1)}.(Iters-1) //MRU for rest

If (m − 1) > df |R|e, we need not partition the tables:

(|R|+|S|)

+ (Iters+1).|T|

- min{|T|,[(m-2)-f|R|]}

- min{|T|,(m-1)}.(Iters-1)

CPU Cost

(nR+nS).hash //Partition R and S

+ nR.(1+dR).copy //Construct hash on R

+ nR.(1+dR).(1-q).copy //R output partitions

+ nS.(2+dS).(1-q).copy //S output partitions

+ (nR+nS).(1-q).hash //Hash on R and S partitions

+ nS.comp.f //Probe for all of S

+ nS.(1+dS+dR).copy //T output partitions

+ Iters.[ //Compute gradient and loss

nS.d.(mult+add) //w.xi for all i

+ nS.(funcG+funcF) //Apply G and F_e

+ nS.d.(mult+add) //Scale and add

+ nS.add //Add for total loss

]

3.3 BGD Over a Join: Stream (S)

This approach performs the join lazily for each iteration.

1. Apply hybrid hash join to obtain T, but instead of

writing T, compute (∇F, F ) on the ﬂy.

2. Repeat step 1 for each iteration.

0 2 4 6 8 10

100

0.1

0 20 40 60 80 100

100

0.1

Feature Ratio (d

)

Redundancy Ratio

Tuple Ratio (n

)

100

0.1

Figure 2: Redundancy ratio against the two dimen-

sion ratios (for d

= 20). (A) Fix

and vary

(B) Fix

and vary

The I/O cost of Stream is simply the cost of the hybrid hash

join multiplied by the number of iterations. Its CPU cost is

a combination of the join and BGD.

Discussion of Tradeoffs. The I/O and storage tradeoﬀs

between Materialize and Stream (Figure 1(B)) arise because

it is likely that many tuples of S join with a single tuple

of R (e.g., many customers might have the same employer).

Thus, |T| is usually larger than |S|+|R|. Obviously, the gap

depends upon the dataset sizes. More precisely, we deﬁne

the redundancy ratio (r) as the ratio of the size of T to that

of S and R:

r =

(1 + d

+ d

)

(2 + d

) + n

(1 + d

)

(1 +

)

(1 +

) +

This ratio is useful because it gives us an idea of the factor

of speedups that are potentially possible by learning over

joins. Since it depends on the dimensions of the inputs, we

plot the redundancy ratio for diﬀerent values of the tuple

ratio (

) and (inverse) feature ratio (

), while ﬁxing d

Figure 2 presents the plots. Typically, both dimension ratios

are > 1, which mostly yields r > 1. But when the tuple ratio

is < 1, r < 1 (see Figure 2(A)). This is because the join here

becomes selective (when n

< n

). However, when the

tuple ratio > 1, we see that r increases with the tuple ratio.

It converges to

≈ 1 +

. Similarly, as shown in

Figure 2(B), the redundancy ratio increases with the feature

ratio, and converges to the tuple ratio

3.4 An Improvement: Stream-Reuse (SR)

We now present a simple modiﬁcation to Stream – the

Stream-Reuse approach – that can signiﬁcantly improve per-

formance.

1. Apply hybrid hash join to obtain T, but instead of

writing T, run the ﬁrst iteration of BGD on the ﬂy.

2. Maintain the temporary partitions of S and R on disk.

3. For the remaining iterations, reuse the partitions of S

and and R for the hybrid hash join, similar to step 1.

The I/O cost of Stream-Reuse gets rid of the rewriting (and

rereading) of partitions at every iteration, but the CPU

cost is reduced only slightly. Stream-Reuse makes the join

“iteration-aware” – we need to divide the implementation

of the hybrid hash join in to two steps so as to reuse the

partitions across iterations. An easier way to implement

(without changing the RDBMS code) is to manually handle

pre-partitioning at the logical query layer after consulting

SUM

∆

( F

, F)

SUM

(RID)

SUM

∆

Logical Schemas:

R(RID, X

)

S(SID, Y, X

, FK)

HR(RID, PartialIP)

HS(RID, SumScaledIP)

Figure 3: Logical workﬂow of factorized learning,

consisting of three steps as numbered. HR and HS

are logical intermediate relations. PartialIP refers

to the partial inner products from R. SumScaledIP

refers to the grouped sums of the scalar output of

G() applied to the full inner products on the con-

catenated feature vectors. Here, γ

SUM

denotes a SUM

aggregation and γ

SUM

(RID) denotes a SUM aggrega-

tion with a GROUP BY on RID.

the optimizer about the number of partitions. Although

the latter is a minor approximation to SR, the diﬀerence in

performance (estimated using our analytical cost models) is

mostly negligible.

4. FACTORIZED LEARNING

We now present a new technique that interleaves the I/O

and CPU processing of the join and BGD. The basic idea

is to avoid the redundancy introduced by the join by divid-

ing the computations of both F and ∇F and “pushing them

through the join”. We call our technique factorized learn-

ing (Factorize, or FL for short), borrowing the terminology

from “factorized” databases [8]. An overview of the logical

computations in FL is presented in Figure 3.

The key insight in FL is as follows: given a feature vector

x ∈ T, we have w

x = w

+ w

. Since the join

duplicates x

from R when constructing T, the main goal of

FL is to avoid redundant inner product computations as well

as I/O over those feature vectors from R. FL achieves this

goal with the following three steps (numbered in Figure 3).

1. Compute and save the partial inner products w

for

each tuple in R in a new table HR under the PartialIP

column (part 1 in Figure 3).

2. Recall that the computation of F and ∇F are clubbed

together, and that ∇F ≡ [∇F

∇F

]. This step com-

putes F and ∇F

together. Essentially, we join HR

and S on RID and complete the computation of the

full inner products on the ﬂy, and follow that up by

applying both F

() and G() on each example. By ag-

gregating both these quantities as it performs the join,

FL completes the computation of F =

(y, w

and ∇F

G(y, w

x)x

. Simultaneously, FL also

performs a GROUP BY on RID and sums up G(y, w

x),

which is saved in a new table HS under the Sum-

ScaledIP column (part 2 in Figure 3).

3. Compute ∇F

G(y, w

x)x

by joining HS with

R on RID and scaling the partial feature vectors x

with SumScaledIP (part 3 in Figure 3).

Example: Consider logistic regression (LR). In step 2, as

the full inner product w

x is computed by joining HR and

Learning Generalized Linear Models Over Normalized Data

Figures

Citations

Machine learning

Heterogeneity-aware Distributed Parameter Servers

Learning Linear Regression Models over Factorized Joins

Model Selection Management Systems: The Next Frontier of Advanced Analytics

Materialization optimizations for feature selection workloads

References

Johnson: Computers and Intractability-A Guide to the Theory of NP-Completeness

Computers and Intractability: A Guide to the Theory of NP-Completeness

The Elements of Statistical Learning: Data Mining, Inference, and Prediction

Numerical Optimization

Machine learning

Related Papers (5)

Learning Linear Regression Models over Factorized Joins

The MADlib analytics library: or MAD skills, the SQL

Towards a unified architecture for in-RDBMS analytics

FAQ: Questions Asked Frequently

MLlib: machine learning in apache spark

Frequently Asked Questions (12)

Q1. What contributions have the authors mentioned in the paper "Learning generalized linear models over normalized data" ?

Q2. What future works have the authors mentioned in the paper "Learning generalized linear models over normalized data" ?

Q3. Why is FL faster than M at higher values of both ratios?

Q4. How do the authors handle datasets that may not fit in memory?

Q5. Why does she join the two tables on the EmployerID?

Q6. What are some of the systems that combine ML with data management platforms?

Q7. What is the main problem of learning after joins?

Q8. What is the way to improve the accuracy of their cost model?

Q9. What is the stepsize parameter used to compute the loss?

Q10. what is the i/o tradeoff between materialize and stream?

Q11. What is the significance of learning over joins?

Q12. Why did the authors use the cost model to predict the runtime trends of each approach?