Clay: fine-grained adaptive partitioning for general database schemas

doi:10.14778/3025111.3025125

Clay: Fine-Grained Adaptive Partitioning

for General Database Schemas

Marco Seraﬁni

△

, Rebecca Taft



, Aaron J. Elmore

♣

,

Andrew Pavlo

♠

, Ashraf Aboulnaga

△

, Michael Stonebraker



△

Qatar Computing Research Institute - HBKU,



Massachusetts Institute of Technology,

♣

University of Chicago,

♠

Carnegie Mellon University

mseraﬁni@qf.org.qa, rytaft@mit.edu, aelmore@cs.uchicago.edu,

pavlo@cs.cmu.edu, aaboulnaga@qf.org.qa, stonebraker@csail.mit.edu

ABSTRACT

Transaction processing database management systems (DBMSs)

are critical for today’s data-intensive applications because they en-

able an organization to quickly ingest and query new information.

Many of these applications exceed the capabilities of a single server,

and thus their database has to be deployed in a distributed DBMS.

The key factor affecting such a system’s performance is how the

database is partitioned. If the database is partitioned incorrectly, the

number of distributed transactions can be high. These transactions

have to synchronize their operations over the network, which is

considerably slower and leads to poor performance. Previous work

on elastic database repartitioning has focused on a certain class of

applications whose database schema can be represented in a hierar-

chical tree structure. But many applications cannot be partitioned

in this manner, and thus are subject to distributed transactions that

impede their performance and scalability.

In this paper, we present a new on-line partitioning approach,

called Clay, that supports both tree-based schemas and more com-

plex “general” schemas with arbitrary foreign key relationships.

Clay dynamically creates blocks of tuples to migrate among servers

during repartitioning, placing no constraints on the schema but tak-

ing care to balance load and reduce the amount of data migrated.

Clay achieves this goal by including in each block a set of hot tuples

and other tuples co-accessed with these hot tuples. To evaluate our

approach, we integrate Clay in a distributed, main-memory DBMS

and show that it can generate partitioning schemes that enable the

system to achieve up to 15× better throughput and 99% lower la-

tency than existing approaches.

1. INTRODUCTION

Shared-nothing, distributed DBMSs are the core component for

modern on-line transaction processing (OLTP) applications in many

diverse domains. These systems partition the database across mul-

tiple nodes (i.e., servers) and route transactions to the appropriate

nodes based on the data that these transactions touch. The key to

achieving good performance is to use a partitioning scheme (i.e., a

mapping of tuples to nodes) that (1) balances load and (2) avoids

This work is licensed under the Creative Commons Attribution-

NonCommercial-NoDerivatives 4.0 International License. To view a copy

of this license, visit http://creativecommons.org/licenses/by-nc-nd/4.0/. For

any use beyond those covered by this license, obtain permission by emailing

info@vldb.org.

Proceedings of the VLDB Endowment, Vol. 10, No. 4

expensive multi-node transactions [5, 23]. Since the load on the

DBMS ﬂuctuates, it is desirable to have an elastic system that auto-

matically changes the database’s partitioning and number of nodes

dynamically depending on load intensity and without having to stop

the system.

The ability to change the partitioning scheme without disrupt-

ing the database is important because OLTP systems incur ﬂuctu-

ating loads. Additionally, many workloads are seasonal or diurnal,

while other applications are subject to dynamic ﬂuctuations in their

workload. For example, the trading volume on the NYSE is an

order of magnitude higher at the beginning and end of the trading

day, and transaction volume spikes when there is relevant breaking

news. Further complicating this problem is the presence of hotspots

that can change over time. These occur because the access pattern

of transactions in the application’s workload is skewed such that

a small portion of the database receives most of the activity. For

example, half of the NYSE trades are on just 1% of the securities.

One could deal with these ﬂuctuations by provisioning for ex-

pected peak load. But this requires deploying a cluster that is over-

provisioned by at least an order of magnitude [27]. Furthermore, if

the performance bottleneck is due to distributed transactions caus-

ing nodes to wait for other nodes, then adding servers will be of

little or no beneﬁt. Thus, over-provisioning is not a good alterna-

tive to effective on-line reconﬁguration.

Previous work has developed techniques to automate DBMS re-

conﬁguration for unpredictable OLTP workloads. For example,

Accordion [26], ElasTras [6], and E-Store [28] all study this prob-

lem. These systems assume that the database is partitioned a pri-

ori into a set of static blocks, and all tuples of a block are moved

together at once. This does not work well if transactions access

tuples in multiple blocks and these blocks are not colocated on the

same server. One study showed that a DBMS’s throughput drops

by half from its peak performance with only 10% of transactions

distributed [23]. This implies that minimizing distributed transac-

tions is just as important as balancing load when ﬁnding an optimal

partitioning plan. To achieve this goal, blocks should be deﬁned

dynamically so that tuples that are frequently accessed together are

grouped in the same block; co-accesses within a block never gener-

ate distributed transactions, regardless of where blocks are placed.

Another problem with the prior approaches is that they only work

for tree schemas. This excludes many applications with schemas

that cannot be transposed into a tree and where deﬁning static blocks

is impossible. For example, consider the Products-Parts-Suppliers

schema shown in Figure 1. This schema contains three tables that

have many-to-many relationships between them. A product uses

many parts, and a supplier sells many parts. If we apply prior ap-

proaches and assume that either Products or Suppliers is the root

445

PartUsage PartSales

PartsProducts Suppliers

prodId

partId

suppId

Figure 1: Products-Parts-Suppliers database schema. Arrows represent

child-parent foreign key relationships.

of a tree, we get an inferior data placement. If we assume Products

is the root, then we will colocate Parts and Suppliers tuples with

their corresponding Products tuples. But this is also bad because

Parts are shared across multiple Products, and Suppliers may

supply many Parts. Hence, there is no good partitioning scheme

that can be identiﬁed by solely looking at the database schema, and

a more general approach is required for such “bird’s nest” schemas.

There is also previous work on off-line database partitioning for

general (i.e., non-tree) schemas with the goal of minimizing dis-

tributed transactions. Schism is a prominent representative of this

line of work [5]. The basic idea is to model the database as a graph

where each vertex represents a tuple, and an edge connects two ver-

tices if their tuples are accessed together in a transaction. An edge’s

weight corresponds to the number of transactions accessing the two

tuples together. Partitions are deﬁned using a MinCut algorithm to

split the graph in a way that minimizes the weight on inter-partition

edges (such edges represent distributed transactions). Schism is an

off-line approach, which means that it needs to be re-run each time

a reconﬁguration is required. As we will explain later, the dual

goals of balancing load and minimizing distributed transactions are

difﬁcult to express in a MinCut problem formulation. Furthermore,

Schism does not take into account the current database conﬁgura-

tion, and thus it cannot minimize data movement.

To overcome all of the above limitations, we present Clay, an

elasticity algorithm that makes no assumptions about the schema,

and is able to simultaneously balance load and minimize distributed

transactions. Unlike existing work on on-line reconﬁguration, which

migrates tuples in static blocks, Clay uses dynamic blocks, called

clumps, that are created on-the-ﬂy by monitoring the workload when

a reconﬁguration is required. The formation of a clump starts from

a hot tuple that the DBMS wants to migrate away from an over-

loaded partition. After identifying such a hot tuple, Clay enlarges

the clump around that tuple by adding its frequently co-accessed

tuples. This avoids generating a large number of distributed trans-

actions when moving the clump. Another advantage of Clay is that

it is incremental, thereby minimizing the cost of data migration. In

our experiments, Clay outperforms another on-line approach based

on the Metis graph partitioning algorithm [16] by 1.7–15× in terms

of throughput and reduces latency by 41–99%. Overall, the perfor-

mance of Clay depends on how skewed the workload is: the higher

the skew, the better the gain to be expected by using Clay.

2. OVERVIEW

We ﬁrst illustrate the main idea of our clump migration technique

using the example of the Products-Parts-Suppliers database from

Figure 1. For simplicity, we examine an instance running on three

servers, each with one partition. Assume that partition P3 becomes

overloaded because it hosts too many hot tuples. When overload is

detected, Clay monitors all transactions executed in the system for

a few seconds. Based on this sample, it builds a heat graph like

the one depicted in Figure 2, where vertices are tuples and edges

represent co-accesses among tuples. The heat graph includes only

6"

1"

4"

Products)

Parts)

Suppliers)

Cold)

Hot)

P1)

P3)

Clump""

To"par..on"P2"

2"

3"

5"

7"

8"

Co-access)

P2)

Warm)

Figure 2: Heat graph example for a Products-Parts-Suppliers database in

partitions P1, P2, and P3. For simplicity, we only consider three degrees

of hotness. P3 is initially overloaded. Clay creates a clump and moves it to

P2. Vertex IDs indicate the order that Clay adds them to the clump.

the tuples whose activity has been observed during the monitoring

interval. Some tuples and edges may be hotter than others. This is

modeled using vertex and edge weights.

Clay builds clumps based on the heat graph. Initially, it creates a

clump consisting of the hottest tuple of the most overloaded parti-

tion – the Suppliers tuple corresponding to vertex #1 in Figure 2. It

then evaluates the effect of moving the clump to another partition.

To minimize distributed transactions, Clay looks for the partition

whose tuples are most frequently accessed with the tuples in the

clump – partition P2 in our example. The move (only vertex #1

at this point) generates too many distributed transactions because

there is a large number of edges between partitions P2 and P3. As

a result, P2 would become overloaded, and P3 would have no ben-

eﬁt from the move due to an increased number of distributed trans-

actions. Therefore, Clay extends the clump with the vertex that is

most frequently co-accessed with the clump, which is vertex #2 in

the example. The process repeats, and the clump is extended with

vertices #3–8. Note that vertices #4 and #6–8 are not co-accessed

with the initial tuple, but are still added to the clump due to the tran-

sitivity of the co-access relation. Note also that vertices #5–8 reside

on a different partition from the initial tuple. Clay ignores the cur-

rent partitioning when building a clump, focusing exclusively on

the co-access patterns and adding afﬁne tuples from any partition.

The process continues until Clay ﬁnds a clump that can be moved to

a partition without overloading it. If the clump cannot be extended

anymore or it reaches a maximum size, Clay scales out the system

by adding a new partition and restarts the clump-ﬁnding process.

To build the heat graph, it is necessary to collect detailed in-

formation about co-accesses among tuples in the same transaction.

Clay performs this on-line monitoring efﬁciently and only for a

short interval of time (∼20 seconds). Although the heat graph can

become large, with up to billions of vertices and edges in our ex-

periments, it is still small enough to ﬁt in main memory; our recon-

ﬁguration algorithm always used less than 4 GB.

3. RELATED WORK

A signiﬁcant amount of research exists on partitioning strategies

for analytic workloads (OLAP), typically balancing locality with

declustering data to maximize parallelism [20, 33]. Some of that

work explicitly considers on-line partitioning algorithms for analyt-

ics [14] or large graphs [32]. We limit our discussion to partition-

446

ing of OLTP databases, since the goals and techniques are different

from partitioning for OLAP applications. Most notably, these ap-

proaches do not combine scaling to tuple-level granularity, mixing

load-balancing with minimizing cross partition transactions, and

building incremental solutions to update the partitioning.

As discussed earlier, Schism [5] is an off-line algorithm that an-

alyzes a transaction log and has no performance monitoring or live

reconﬁguration components. It builds an access graph similar to

our heat graph and uses Metis [16] to ﬁnd a partitioning that min-

imizes the edge cuts. But since Metis cannot support large graphs,

the DBA must pre-process the traces by sampling transactions and

tuples, ﬁltering by access frequency, and aggregating tuples that are

always accessed together in a single vertex. Since keeping an ex-

plicit mapping of every tuple to a partition would result in a huge

routing table, Schism creates a decision tree that simpliﬁes the in-

dividual mapping of tuples into a set of range partitions. Finally, in

the ﬁnal validation step, Schism compares different solutions ob-

tained in the previous steps and selects the one having the lowest

rate of distributed transactions. Clay’s clump migration heuristic

is incremental, so it minimizes data migration, and it outperforms

Metis when applied to the heat graph. In addition, Clay’s two-tiered

routing creates small sets of hot tuples that minimize the size of the

routing tables, so it does not require Schism’s extra steps.

Sword [25] is another off-line partitioning tool that models the

database as a hypergraph and uses an incremental heuristic to ap-

proximate constrained n-way graph partitioning. It uses a one-

tier routing scheme that divides the database into coarse-grained

chunks. Sword performs incremental partitioning adjustments by

periodically evaluating the effect of swapping pairs of chunks. Our

experiments show that Clay outperforms state-of-the-art algorithms

that compute constrained n-way graph partitioning from scratch.

Furthermore, Clay adopts a two-tiered approach that supports ﬁne-

grained mapping for single tuples.

Like Schism and Sword, JECB [30] provides a partitioning strat-

egy to handle complex schemas, but the focus is on scalable par-

titioning for large clusters. JECB examines a workload, database

schema, and source code to derive a new partitioning plan using a

divide-and-conquer strategy. The work does not explicitly consider

hot and cold partitions (or tuples) that arise from workload skew.

PLP is a technique to address partitioning in a single-server,

shared-memory system to minimize bottlenecks that arise from con-

tention [22]. The approach recursively splits a tree amongst dedi-

cated executors. PLP focuses on workload skew, and does not ex-

plicitly consider co-accesses between tuples or scaling out across

multiple machines. ATraPos improves on PLP by minimizing ac-

cesses to centralized data structures [24]. It considers a certain

number of sub-partitions (similar to algorithms using static blocks)

and assigns them to processor sockets in a way that balances load

and minimizes the inter-process synchronization overhead.

None of the aforementioned papers discuss elasticity (i.e., adding

and removing nodes), but there are several systems that enable

elastic scaling through limiting the scope of transactions. Mega-

store [2] uses entity groups to identify a set of tuples that are se-

mantically related, and limit multi-object transactions to within the

group. Others have presented a technique to identify entity groups

given a schema and workload trace [17]. This approach is similar

to Clay in that it greedily builds sets of related items, but it focuses

on breaking a schema into groups, and load-balancing and tuple-to-

partition mapping are not factors in the grouping. Similarly, Hor-

ticulture [23] identiﬁes the ideal attributes to partition each table

but does not address related tuple placement. Beyond small entity

groups, ElasTras [6], NuoDB [21], and Microsoft’s cloud-based

SQL Server [3] achieve elastic scaling on complex structures by

limiting transactions to a single partition. Although ElasTras does

support elastic scaling, the system does not specify how to split

and merge partitions to balance workload skew and tuple afﬁnity.

Many key-value stores support intelligent data placement for load-

balancing and elastic scaling [31, 11, 9], but provide weaker trans-

action guarantees than a relational DBMS.

Accordion [26] provides coarse-grained elastic partitioning: the

database is pre-partitioned into a relatively small number of data

chunks (or virtual partitions), each potentially comprising a large

number of tuples. The limitation on the number of chunks is given

by Accordion’s Mixed Integer Linear Programming (MILP) solver

to ﬁnd an optimal plan. The problem with a coarse-grained ap-

proach is that it cannot deal with skewed workloads where multiple

hot tuples may be concentrated in one data chunk [28]. Accordion

learns the shape of the capacity function for each conﬁguration.

With few chunks there are only relatively few conﬁgurations, but if

we consider each tuple as a potential chunk, then it becomes impos-

sible to build an accurate capacity model for every conﬁguration.

Coarse-grained approaches have problems with skewed work-

loads where multiple hot tuples can end up in the same chunk. E-

Store [28] supports tree-schemas by using a two-tiered approach for

load-balancing. It uses ﬁne-grained partitioning for a small number

of hot tuples, and a coarse-grained partitioning for the rest of the

database. Targeting hot tuples in this manner allows the system to

identify hot spots, but it has limitations. Consider the case where

two hot tuples are frequently accessed together in a transaction. E-

Store ignores such co-accesses, so it can independently place hot

tuples on different servers, thereby generating a large number of

distributed transactions. To avoid this problem, E-Store must as-

sume that the database schema is tree-structured and every transac-

tion accesses only the tree of one root tuple. Hence, a root tuple and

its descendants are moved as a unit. Lastly, E-store fails to address

co-accesses to hot dependent tuples.

Online workload monitoring has been used to deal with hot keys

also in stream processing systems [18, 19] and in general sharding

systems like Slicer [1]. However, these systems have no notion of

transactions or co-accesses.

4. PROBLEM STATEMENT

We now deﬁne the data placement problem that Clay seeks to

solve. A database consists of a set of tables hT

1

. . . T

t

i. Each table

T

i

has one or more partitioning attributes hA

i

1

, . . . , A

i

h

i, which

are a subset of the total set of attributes of T

i

. Tuples are hori-

zontally partitioned across a set of servers s

1

, . . . , s

j

. All tuples

of table T

i

with the same values of their partitioning attributes

hA

i

1

= x

1

, . . . , A

i

h

= x

h

i are placed on the same server and are

modeled as a vertex. The database sample is represented as a set

of vertices V = {v

1

, . . . , v

n

}, where each vertex v has a weight

w( v) denoting how frequently the vertex is accessed. Co-accesses

between two vertices are modeled as an edge, whose weight de-

notes the frequency of the co-accesses. We call the resulting graph

G(V, E) the heat graph having vertices in V and edges in E.

Data placement is driven by a partitioning plan P : V → Π

that maps each vertex to a partition in Π based on the value of its

partitioning attributes. A single partition can correspond to a server,

or multiple partitions can be statically mapped onto a single server.

4.1 Incremental Data Placement Problem

Clay solves an incremental data placement problem that can be

formulated as follows. The system starts from an initial plan P .

Let L

P

(p) be the load of partition p in the plan P , let ǫ be the

percentage of load imbalance that we allow in the system, and let

θ be the average load across all partitions in the plan P multiplied

447

by 1 + ǫ. Let P be the current partitioning plan, P

′

be the next

partitioning plan identiﬁed by Clay, and ∆(P, P

′

) be the number

of vertices mapped to a different partition in P and P

′

. Given this,

the system seeks to minimize the following objective function:

minimize |P

′

|, ∆(P, P

′

) (1)

s.t. ∀p ∈ Π : L

P

′

(p) < θ

We specify the two objectives in order of priority. First, we min-

imize the number of partitions in P

′

. Second, we minimize the

amount of data movement among the solutions with the same num-

ber of partitions. In either case, we limit the load imbalance to be

at most ǫ. We deﬁne w as the weight of a vertex or edge, E as

the set of edges in the heat graph, and k > 0 as a constant that

indicates the cost of multi-partition tuple accesses, which require

additional coordination. Given this, the load of a partition p ∈ Π

in a partitioning plan P is expressed as follows:

L

P

(p) =

X

v∈V

P (v)=p

w( v) +

X

v∈V

u∈V

P (v)=p

hv,ui∈E

P (u)6=p

w( hv, u i) · k (2)

The parameter k indicates how much to prioritize solutions that

minimize distributed transactions over ones that balance tuple ac-

cesses. Increasing k gives greater weight to the number of dis-

tributed transactions in the determination of the load of a partition.

4.2 Comparison with Graph Partitioning

We now revisit the issue of comparing Clay with graph partition-

ing techniques, and in particular to the common variant solved by

Metis [16] in Schism [5]. The incremental data placement prob-

lem is different from constrained n-way graph partitioning on the

heat graph, where n is the number of database partitions. The ﬁrst

distinction is incrementality, since a graph partitioner ignores the

previous plan P and produces a new plan P

′

from scratch. By com-

puting a new P

′

, the DBMS may have to shufﬂe data to transition

from P to P

′

, which will degrade its performance. We contend,

however, that the difference is not limited to incrementality.

Graph partitioning produces a plan P

′

that minimizes the num-

ber of edges across partitions under the constraint of a maximum

load imbalance among partitions. The load of a partition is ex-

pressed as the sum of the weights of the vertices in the partition:

ˆ

L

P

(p) =

X

v∈V

P (v)=p

w( v) (3)

To be more precise, consider the Metis graph partitioner that

solves the following problem:

minimize |{hv, ui ∈ E : P

′

(v) 6= P

′

(u)}| (4)

s.t. ∀p, q ∈ Π :

ˆ

L

P

′

(p)/

ˆ

L

P

′

(q) < 1 + η

where η is an imbalance constraint provided by the user.

The load balancing constraint is over a load function,

ˆ

L

P

, which

does not take into account the cost of distributed transactions. In

graph terms, the function does not take into account the load caused

by cross-partition edges. This is in contrast with the deﬁnitions of

Equations 1 and 2, where the load threshold considers at the same

time both local and remote tuple accesses and their respective cost.

The formulation of Equations 1 and 2 has two advantages over

Equations 3 and 4. The ﬁrst is that the former constrains the num-

ber of cross-partition edges per partition, whereas Equation 4 min-

imizes the total number of cross-partition edges. Therefore, Equa-

tion 4 could create a “star” partitioning setting where all cross-

partition edges, and thus all distributed transactions, are incident

on a single server, causing that server to be highly overloaded.

!"#$%&'()*"+(,-.+/*$+01*%&-

2345-!6/*%7-8%9,9:-;'!*)$%<

4$#(/#=*+)(-

>)(+*)$-#(&-

;+,"'3%?%@-

!6/*%7->)(+*)$

A@17B-

>+,$#*+)(

Current'partition'plan

Transactional'access'trace

C%=)(D+,1$#*+)(-E(,+(%-

8%9,9:-!F1#@@<

Reconfiguration'plan

Intercept'SQL'query'

processing

Figure 3: The Clay framework.

The second advantage of our formulation is that it combines the

two conﬂicting goals of reaching balanced tuple accesses and min-

imizing distributed transactions using a single load function. Con-

sidering the two goals as separate makes it difﬁcult to ﬁnd a good

level of η, as our experiments show. In fact, if the threshold is

too low, Metis creates a balanced load in terms of single partition

transactions, but it also causes many distributed transactions. If the

threshold is too large, Metis causes fewer distributed transactions,

but then the load is not necessarily balanced.

One could consider using L

P

instead of

ˆ

L

P

as the expression of

load in Equation 4. Unfortunately, this is not possible since vertex

weights need to be provided as an input to the graph partitioner,

whereas the number of cross-partition edges depends on the out-

come of the partitioning itself.

5. SYSTEM ARCHITECTURE

Clay runs on top of a distributed OLTP DBMS and a reconﬁg-

uration engine that can dynamically change the data layout of the

database (see Figure 3). The monitoring component of Clay is acti-

vated whenever performance objectives are not met (e.g., when the

latency of the system does not meet an SLA). If these conditions

occur, Clay starts a transaction monitor that collects detailed work-

load monitoring information (Section 6). This information is sent

to a centralized reconﬁguration controller that builds a heat graph.

The controller then runs our migration algorithm that builds clumps

on the ﬂy and determines how to migrate them (Section 7).

Although Clay’s mechanisms are generic, the implementation

that we use in our evaluation is based on the H-Store system [12,

15]. H-Store is a distributed, in-memory DBMS that is optimized

for OLTP workloads and assumes that most transactions are short-

lived and datasets are easily partitioned. The original H-Store de-

sign supports a static conﬁguration where the set of partitions and

hosts and the mapping between tuples and partitions are all ﬁxed.

The E-Store [28] system relaxes some of these restrictions by al-

lowing for a dynamic number of partitions and nodes. E-Store also

changes how tuples are mapped to partitions by using a two-tiered

partitioning scheme that uses ﬁne-grained partitioning (e.g., range

partitioning) for a set of “hot” tuples and then a simple scheme

(e.g., range partitioning of large chunks or hash partitioning) for

large blocks of “cold” tuples. Clay uses this same two-tier partition-

ing scheme in H-Store. It also uses Squall for reconﬁguration [8],

although its techniques are agnostic to it.

6. TRANSACTION MONITORING

The data placement problem of Section 4 models a database as a

weighted graph. The monitoring component collects the necessary

information to build the graph: it counts the number of accesses to

tuples (vertices) and the co-accesses (edges) among tuples.

448

Monitoring tracks tuple accesses by hooking onto the transaction

routing module. When processing a transaction, H-Store breaks

SQL statements into smaller fragments that execute low-level op-

erations. It then routes these fragments to one or more partitions

based on the values of the partitioning attributes of the tuples that

are accessed by the fragment.

Clay performs monitoring by adding hooks in the DBMS’s query

processing components that extract the values of the partitioning

attributes used to route the fragments. These values correspond to

speciﬁc vertices of the graph, as discussed in Section 4. The mon-

itoring component is executed by each server and writes tuple ac-

cesses onto a monitoring ﬁle using the format htid, T, x

1

, . . . , x

h

i,

where tid is a unique id associated with the transactions perform-

ing the access, T is the table containing the accessed tuple, h is the

number of partitioning attributes of table T , and x

i

is the value of

the i

th

partitioning attribute in the accessed tuple. When a transac-

tion is completed, monitoring adds an entry hEND, tidi.

Query-level monitoring captures more detailed information than

related approaches. It is able to determine not only which tuples are

accessed, but also which tuples are accessed together by the same

transaction. E-Store restricted monitoring to root tuples because of

the high cost of using low-level operations to track access patterns

for single tuples [28]. Our evaluation shows that Clay’s monitoring

is more accurate and has low overhead.

Clay disables monitoring during normal transaction execution

and only turns it on when some application-speciﬁc objectives are

violated (e.g., if the 99

th

percentile latency exceeds a pre-deﬁned

target). After being turned on, monitoring remains active for a short

time. Our experiments established that 20 seconds was sufﬁcient to

detect frequently-accessed hot tuples.

Once a server terminates monitoring, it sends the collected data

to a centralized controller that builds the heat graph (see Figure 2)

and computes the new plan. For every access to a tuple/vertex v

found in the monitoring data, the controller increments v’s weight

by one divided by the length of the monitoring interval for the

server, to reﬂect the rate of transaction accesses. Vertices accessed

by the same transactions are connected by an edge whose weight is

computed similarly to a vertex weight.

7. CLUMP MIGRATION

The clump migration algorithm is the central component of Clay.

It takes as input the current partitioning plan P , which maps tu-

ples to partitions, and the heat graph G produced by the monitoring

component. Its output is a new partitioning plan (see Section 4).

We now describe the algorithm more in detail.

7.1 Dealing with Overloaded Partitions

The clump migration algorithm of Clay starts by identifying the

set of “overloaded” partitions that have a load higher than a thresh-

old θ. The load per partition is deﬁned by the formula in Equation 2

(we used a value of k = 50 in all our experiments since we found

that distributed transactions impact performance much more than

local tuple accesses). For each overloaded partition P

o

, the migra-

tion algorithm dynamically deﬁnes and migrates clumps until the

load of P

o

is below the θ threshold (see Algorithm 1). A clump

created to ofﬂoad a partition P

o

will contain some tuples of P

o

but

it can also contain tuples from other partitions. A move is a pair

consisting of a clump and a destination partition.

Initializing a clump. The algorithm starts with an empty clump

M. It then sets M to be the hottest vertex h in the hot tuples list

H(P

o

), which contains the most frequently accessed vertices for

Algorithm 1: Migration algorithm to ofﬂoad partition P

o

look-ahead ← A;

while L(P

o

) > θ do

if M = ∅ then

// initialize the clump

h ← next hot tuple in H(P

o

);

M ← {h};

d ← initial-partition(M );

else if some vertex in M has neighbors then

// expand the clump

M ← M ∪ most-co-accessed-neighbor(M, G);

d ← update-dest(M, d);

else

// cannot expand the clump anymore

if C 6= ∅ then

move C.M to C.d;

M ← ∅;

look-ahead ← A;

else

add a new server and restart the algorithm;

// examine the new clump

if feasible(M, d) then

C.M ← M;

C.d ← d;

else if C 6= ∅ then

look-ahead ← look-ahead −1;

if look-ahead = 0 then

move C.M to C.d;

M ← ∅;

look-ahead ← A;

Algorithm 2: Finding the best destination for M

function update-dest(M, d)

if ¬ feasible(M, d) then

a ← partition most frequently accessed with M;

if a 6= d ∧ feasible(M, a) then

return a ;

l ← least loaded partition;

if



l 6= d



∧



∆

r

(M, a) < ∆

r

(M, l)



∧ feasible(M, l)

then

return l

return d ;

each partition (i.e., those having the highest weight, in descending

order of access frequency).

The algorithm then picks the destination partition that minimizes

the overall load of the system. The function initial-partition se-

lects the destination partition d having the lowest receiver delta

∆

r

(M, d), where ∆

r

(M, d) is deﬁned as the load of d after re-

ceiving M minus the load of d before receiving M. Given the way

the load function is deﬁned (see Equation 2), the partition with the

lowest receiver delta is the one whose tuples are most frequently

co-accessed with the tuples in M , so moving M to that partition

minimizes the number of distributed transactions. The initial selec-

tion of d prioritizes partitions that do not become overloaded after

the move, if available. Among partitions with the same receiver

delta, the heuristic selects the one with the lowest overall load. In

systems like H-Store that run multiple partitions on the same physi-

cal server, the cost function assigns a lower cost to transactions that

access partitions in the same server than to distributed transactions.

Expanding a clump. If M is not empty, it is extended with the

neighboring tuple of M that is most frequently co-accessed with

a tuple in M. This is found by iterating over all the neighbors of

449

Clay: fine-grained adaptive partitioning for general database schemas

Citations

Searchable Symmetric Encryption with Forward Search Privacy

CockroachDB: The Resilient Geo-Distributed SQL Database

P-Store: An Elastic Database System with Predictive Provisioning

Qd-tree: Learning Data Layouts for Big Data Analytics

Optimal column layout for hybrid workloads

References

Measuring User Influence in Twitter: The Million Follower Fallacy

Social networks that matter: Twitter under the microscope

Megastore: Providing Scalable, Highly Available Storage for Interactive Services

Social networks that matter: Twitter under the microscope

Schism: a workload-driven approach to database replication and partitioning

Related Papers (5)

Schism: a workload-driven approach to database replication and partitioning

Skew-aware automatic database partitioning in shared-nothing, parallel OLTP systems

Calvin: fast distributed transactions for partitioned database systems

H-store: a high-performance, distributed main memory transaction processing system

Benchmarking cloud serving systems with YCSB