scispace - formally typeset
Journal ArticleDOI

Clay: fine-grained adaptive partitioning for general database schemas

Reads0
Chats0
TLDR
A new on-line partitioning approach, called Clay, that supports both tree-based schemas and more complex "general" schemas with arbitrary foreign key relationships is presented and it is shown that it can generate partitioning schemes that enable the system to achieve up to 15× better throughput and 99% lower latency than existing approaches.
Abstract
Transaction processing database management systems (DBMSs) are critical for today's data-intensive applications because they enable an organization to quickly ingest and query new information. Many of these applications exceed the capabilities of a single server, and thus their database has to be deployed in a distributed DBMS. The key factor affecting such a system's performance is how the database is partitioned. If the database is partitioned incorrectly, the number of distributed transactions can be high. These transactions have to synchronize their operations over the network, which is considerably slower and leads to poor performance. Previous work on elastic database repartitioning has focused on a certain class of applications whose database schema can be represented in a hierarchical tree structure. But many applications cannot be partitioned in this manner, and thus are subject to distributed transactions that impede their performance and scalability.In this paper, we present a new on-line partitioning approach, called Clay, that supports both tree-based schemas and more complex "general" schemas with arbitrary foreign key relationships. Clay dynamically creates blocks of tuples to migrate among servers during repartitioning, placing no constraints on the schema but taking care to balance load and reduce the amount of data migrated. Clay achieves this goal by including in each block a set of hot tuples and other tuples co-accessed with these hot tuples. To evaluate our approach, we integrate Clay in a distributed, main-memory DBMS and show that it can generate partitioning schemes that enable the system to achieve up to 15× better throughput and 99% lower latency than existing approaches.

read more

Content maybe subject to copyright    Report

Clay: Fine-Grained Adaptive Partitioning
for General Database Schemas
Marco Serafini
, Rebecca Taft
, Aaron J. Elmore
,
Andrew Pavlo
, Ashraf Aboulnaga
, Michael Stonebraker
Qatar Computing Research Institute - HBKU,
Massachusetts Institute of Technology,
University of Chicago,
Carnegie Mellon University
mserafini@qf.org.qa, rytaft@mit.edu, aelmore@cs.uchicago.edu,
pavlo@cs.cmu.edu, aaboulnaga@qf.org.qa, stonebraker@csail.mit.edu
ABSTRACT
Transaction processing database management systems (DBMSs)
are critical for today’s data-intensive applications because they en-
able an organization to quickly ingest and query new information.
Many of these applications exceed the capabilities of a single server,
and thus their database has to be deployed in a distributed DBMS.
The key factor affecting such a system’s performance is how the
database is partitioned. If the database is partitioned incorrectly, the
number of distributed transactions can be high. These transactions
have to synchronize their operations over the network, which is
considerably slower and leads to poor performance. Previous work
on elastic database repartitioning has focused on a certain class of
applications whose database schema can be represented in a hierar-
chical tree structure. But many applications cannot be partitioned
in this manner, and thus are subject to distributed transactions that
impede their performance and scalability.
In this paper, we present a new on-line partitioning approach,
called Clay, that supports both tree-based schemas and more com-
plex “general” schemas with arbitrary foreign key relationships.
Clay dynamically creates blocks of tuples to migrate among servers
during repartitioning, placing no constraints on the schema but tak-
ing care to balance load and reduce the amount of data migrated.
Clay achieves this goal by including in each block a set of hot tuples
and other tuples co-accessed with these hot tuples. To evaluate our
approach, we integrate Clay in a distributed, main-memory DBMS
and show that it can generate partitioning schemes that enable the
system to achieve up to 15× better throughput and 99% lower la-
tency than existing approaches.
1. INTRODUCTION
Shared-nothing, distributed DBMSs are the core component for
modern on-line transaction processing (OLTP) applications in many
diverse domains. These systems partition the database across mul-
tiple nodes (i.e., servers) and route transactions to the appropriate
nodes based on the data that these transactions touch. The key to
achieving good performance is to use a partitioning scheme (i.e., a
mapping of tuples to nodes) that (1) balances load and (2) avoids
This work is licensed under the Creative Commons Attribution-
NonCommercial-NoDerivatives 4.0 International License. To view a copy
of this license, visit http://creativecommons.org/licenses/by-nc-nd/4.0/. For
any use beyond those covered by this license, obtain permission by emailing
info@vldb.org.
Proceedings of the VLDB Endowment, Vol. 10, No. 4
Copyright 2016 VLDB Endowment 2150-8097/16/12.
expensive multi-node transactions [5, 23]. Since the load on the
DBMS fluctuates, it is desirable to have an elastic system that auto-
matically changes the database’s partitioning and number of nodes
dynamically depending on load intensity and without having to stop
the system.
The ability to change the partitioning scheme without disrupt-
ing the database is important because OLTP systems incur fluctu-
ating loads. Additionally, many workloads are seasonal or diurnal,
while other applications are subject to dynamic fluctuations in their
workload. For example, the trading volume on the NYSE is an
order of magnitude higher at the beginning and end of the trading
day, and transaction volume spikes when there is relevant breaking
news. Further complicating this problem is the presence of hotspots
that can change over time. These occur because the access pattern
of transactions in the application’s workload is skewed such that
a small portion of the database receives most of the activity. For
example, half of the NYSE trades are on just 1% of the securities.
One could deal with these fluctuations by provisioning for ex-
pected peak load. But this requires deploying a cluster that is over-
provisioned by at least an order of magnitude [27]. Furthermore, if
the performance bottleneck is due to distributed transactions caus-
ing nodes to wait for other nodes, then adding servers will be of
little or no benefit. Thus, over-provisioning is not a good alterna-
tive to effective on-line reconfiguration.
Previous work has developed techniques to automate DBMS re-
configuration for unpredictable OLTP workloads. For example,
Accordion [26], ElasTras [6], and E-Store [28] all study this prob-
lem. These systems assume that the database is partitioned a pri-
ori into a set of static blocks, and all tuples of a block are moved
together at once. This does not work well if transactions access
tuples in multiple blocks and these blocks are not colocated on the
same server. One study showed that a DBMS’s throughput drops
by half from its peak performance with only 10% of transactions
distributed [23]. This implies that minimizing distributed transac-
tions is just as important as balancing load when finding an optimal
partitioning plan. To achieve this goal, blocks should be defined
dynamically so that tuples that are frequently accessed together are
grouped in the same block; co-accesses within a block never gener-
ate distributed transactions, regardless of where blocks are placed.
Another problem with the prior approaches is that they only work
for tree schemas. This excludes many applications with schemas
that cannot be transposed into a tree and where defining static blocks
is impossible. For example, consider the Products-Parts-Suppliers
schema shown in Figure 1. This schema contains three tables that
have many-to-many relationships between them. A product uses
many parts, and a supplier sells many parts. If we apply prior ap-
proaches and assume that either Products or Suppliers is the root
445

PartUsage PartSales
PartsProducts Suppliers
prodId
partId
partId
suppId
Figure 1: Products-Parts-Suppliers database schema. Arrows represent
child-parent foreign key relationships.
of a tree, we get an inferior data placement. If we assume Products
is the root, then we will colocate Parts and Suppliers tuples with
their corresponding Products tuples. But this is also bad because
Parts are shared across multiple Products, and Suppliers may
supply many Parts. Hence, there is no good partitioning scheme
that can be identified by solely looking at the database schema, and
a more general approach is required for such “bird’s nest” schemas.
There is also previous work on off-line database partitioning for
general (i.e., non-tree) schemas with the goal of minimizing dis-
tributed transactions. Schism is a prominent representative of this
line of work [5]. The basic idea is to model the database as a graph
where each vertex represents a tuple, and an edge connects two ver-
tices if their tuples are accessed together in a transaction. An edge’s
weight corresponds to the number of transactions accessing the two
tuples together. Partitions are defined using a MinCut algorithm to
split the graph in a way that minimizes the weight on inter-partition
edges (such edges represent distributed transactions). Schism is an
off-line approach, which means that it needs to be re-run each time
a reconfiguration is required. As we will explain later, the dual
goals of balancing load and minimizing distributed transactions are
difficult to express in a MinCut problem formulation. Furthermore,
Schism does not take into account the current database configura-
tion, and thus it cannot minimize data movement.
To overcome all of the above limitations, we present Clay, an
elasticity algorithm that makes no assumptions about the schema,
and is able to simultaneously balance load and minimize distributed
transactions. Unlike existing work on on-line reconfiguration, which
migrates tuples in static blocks, Clay uses dynamic blocks, called
clumps, that are created on-the-fly by monitoring the workload when
a reconfiguration is required. The formation of a clump starts from
a hot tuple that the DBMS wants to migrate away from an over-
loaded partition. After identifying such a hot tuple, Clay enlarges
the clump around that tuple by adding its frequently co-accessed
tuples. This avoids generating a large number of distributed trans-
actions when moving the clump. Another advantage of Clay is that
it is incremental, thereby minimizing the cost of data migration. In
our experiments, Clay outperforms another on-line approach based
on the Metis graph partitioning algorithm [16] by 1.7–15× in terms
of throughput and reduces latency by 41–99%. Overall, the perfor-
mance of Clay depends on how skewed the workload is: the higher
the skew, the better the gain to be expected by using Clay.
2. OVERVIEW
We first illustrate the main idea of our clump migration technique
using the example of the Products-Parts-Suppliers database from
Figure 1. For simplicity, we examine an instance running on three
servers, each with one partition. Assume that partition P3 becomes
overloaded because it hosts too many hot tuples. When overload is
detected, Clay monitors all transactions executed in the system for
a few seconds. Based on this sample, it builds a heat graph like
the one depicted in Figure 2, where vertices are tuples and edges
represent co-accesses among tuples. The heat graph includes only
6"
1"
4"
Products)
Parts)
Suppliers)
Cold)
Hot)
P1)
P3)
Clump""
To"par..on"P2"
2"
3"
5"
7"
8"
Co-access)
P2)
Warm)
Figure 2: Heat graph example for a Products-Parts-Suppliers database in
partitions P1, P2, and P3. For simplicity, we only consider three degrees
of hotness. P3 is initially overloaded. Clay creates a clump and moves it to
P2. Vertex IDs indicate the order that Clay adds them to the clump.
the tuples whose activity has been observed during the monitoring
interval. Some tuples and edges may be hotter than others. This is
modeled using vertex and edge weights.
Clay builds clumps based on the heat graph. Initially, it creates a
clump consisting of the hottest tuple of the most overloaded parti-
tion the Suppliers tuple corresponding to vertex #1 in Figure 2. It
then evaluates the effect of moving the clump to another partition.
To minimize distributed transactions, Clay looks for the partition
whose tuples are most frequently accessed with the tuples in the
clump partition P2 in our example. The move (only vertex #1
at this point) generates too many distributed transactions because
there is a large number of edges between partitions P2 and P3. As
a result, P2 would become overloaded, and P3 would have no ben-
efit from the move due to an increased number of distributed trans-
actions. Therefore, Clay extends the clump with the vertex that is
most frequently co-accessed with the clump, which is vertex #2 in
the example. The process repeats, and the clump is extended with
vertices #3–8. Note that vertices #4 and #6–8 are not co-accessed
with the initial tuple, but are still added to the clump due to the tran-
sitivity of the co-access relation. Note also that vertices #5–8 reside
on a different partition from the initial tuple. Clay ignores the cur-
rent partitioning when building a clump, focusing exclusively on
the co-access patterns and adding affine tuples from any partition.
The process continues until Clay finds a clump that can be moved to
a partition without overloading it. If the clump cannot be extended
anymore or it reaches a maximum size, Clay scales out the system
by adding a new partition and restarts the clump-finding process.
To build the heat graph, it is necessary to collect detailed in-
formation about co-accesses among tuples in the same transaction.
Clay performs this on-line monitoring efficiently and only for a
short interval of time (20 seconds). Although the heat graph can
become large, with up to billions of vertices and edges in our ex-
periments, it is still small enough to fit in main memory; our recon-
figuration algorithm always used less than 4 GB.
3. RELATED WORK
A significant amount of research exists on partitioning strategies
for analytic workloads (OLAP), typically balancing locality with
declustering data to maximize parallelism [20, 33]. Some of that
work explicitly considers on-line partitioning algorithms for analyt-
ics [14] or large graphs [32]. We limit our discussion to partition-
446

ing of OLTP databases, since the goals and techniques are different
from partitioning for OLAP applications. Most notably, these ap-
proaches do not combine scaling to tuple-level granularity, mixing
load-balancing with minimizing cross partition transactions, and
building incremental solutions to update the partitioning.
As discussed earlier, Schism [5] is an off-line algorithm that an-
alyzes a transaction log and has no performance monitoring or live
reconfiguration components. It builds an access graph similar to
our heat graph and uses Metis [16] to find a partitioning that min-
imizes the edge cuts. But since Metis cannot support large graphs,
the DBA must pre-process the traces by sampling transactions and
tuples, filtering by access frequency, and aggregating tuples that are
always accessed together in a single vertex. Since keeping an ex-
plicit mapping of every tuple to a partition would result in a huge
routing table, Schism creates a decision tree that simplifies the in-
dividual mapping of tuples into a set of range partitions. Finally, in
the final validation step, Schism compares different solutions ob-
tained in the previous steps and selects the one having the lowest
rate of distributed transactions. Clay’s clump migration heuristic
is incremental, so it minimizes data migration, and it outperforms
Metis when applied to the heat graph. In addition, Clay’s two-tiered
routing creates small sets of hot tuples that minimize the size of the
routing tables, so it does not require Schism’s extra steps.
Sword [25] is another off-line partitioning tool that models the
database as a hypergraph and uses an incremental heuristic to ap-
proximate constrained n-way graph partitioning. It uses a one-
tier routing scheme that divides the database into coarse-grained
chunks. Sword performs incremental partitioning adjustments by
periodically evaluating the effect of swapping pairs of chunks. Our
experiments show that Clay outperforms state-of-the-art algorithms
that compute constrained n-way graph partitioning from scratch.
Furthermore, Clay adopts a two-tiered approach that supports fine-
grained mapping for single tuples.
Like Schism and Sword, JECB [30] provides a partitioning strat-
egy to handle complex schemas, but the focus is on scalable par-
titioning for large clusters. JECB examines a workload, database
schema, and source code to derive a new partitioning plan using a
divide-and-conquer strategy. The work does not explicitly consider
hot and cold partitions (or tuples) that arise from workload skew.
PLP is a technique to address partitioning in a single-server,
shared-memory system to minimize bottlenecks that arise from con-
tention [22]. The approach recursively splits a tree amongst dedi-
cated executors. PLP focuses on workload skew, and does not ex-
plicitly consider co-accesses between tuples or scaling out across
multiple machines. ATraPos improves on PLP by minimizing ac-
cesses to centralized data structures [24]. It considers a certain
number of sub-partitions (similar to algorithms using static blocks)
and assigns them to processor sockets in a way that balances load
and minimizes the inter-process synchronization overhead.
None of the aforementioned papers discuss elasticity (i.e., adding
and removing nodes), but there are several systems that enable
elastic scaling through limiting the scope of transactions. Mega-
store [2] uses entity groups to identify a set of tuples that are se-
mantically related, and limit multi-object transactions to within the
group. Others have presented a technique to identify entity groups
given a schema and workload trace [17]. This approach is similar
to Clay in that it greedily builds sets of related items, but it focuses
on breaking a schema into groups, and load-balancing and tuple-to-
partition mapping are not factors in the grouping. Similarly, Hor-
ticulture [23] identifies the ideal attributes to partition each table
but does not address related tuple placement. Beyond small entity
groups, ElasTras [6], NuoDB [21], and Microsoft’s cloud-based
SQL Server [3] achieve elastic scaling on complex structures by
limiting transactions to a single partition. Although ElasTras does
support elastic scaling, the system does not specify how to split
and merge partitions to balance workload skew and tuple affinity.
Many key-value stores support intelligent data placement for load-
balancing and elastic scaling [31, 11, 9], but provide weaker trans-
action guarantees than a relational DBMS.
Accordion [26] provides coarse-grained elastic partitioning: the
database is pre-partitioned into a relatively small number of data
chunks (or virtual partitions), each potentially comprising a large
number of tuples. The limitation on the number of chunks is given
by Accordion’s Mixed Integer Linear Programming (MILP) solver
to find an optimal plan. The problem with a coarse-grained ap-
proach is that it cannot deal with skewed workloads where multiple
hot tuples may be concentrated in one data chunk [28]. Accordion
learns the shape of the capacity function for each configuration.
With few chunks there are only relatively few configurations, but if
we consider each tuple as a potential chunk, then it becomes impos-
sible to build an accurate capacity model for every configuration.
Coarse-grained approaches have problems with skewed work-
loads where multiple hot tuples can end up in the same chunk. E-
Store [28] supports tree-schemas by using a two-tiered approach for
load-balancing. It uses fine-grained partitioning for a small number
of hot tuples, and a coarse-grained partitioning for the rest of the
database. Targeting hot tuples in this manner allows the system to
identify hot spots, but it has limitations. Consider the case where
two hot tuples are frequently accessed together in a transaction. E-
Store ignores such co-accesses, so it can independently place hot
tuples on different servers, thereby generating a large number of
distributed transactions. To avoid this problem, E-Store must as-
sume that the database schema is tree-structured and every transac-
tion accesses only the tree of one root tuple. Hence, a root tuple and
its descendants are moved as a unit. Lastly, E-store fails to address
co-accesses to hot dependent tuples.
Online workload monitoring has been used to deal with hot keys
also in stream processing systems [18, 19] and in general sharding
systems like Slicer [1]. However, these systems have no notion of
transactions or co-accesses.
4. PROBLEM STATEMENT
We now define the data placement problem that Clay seeks to
solve. A database consists of a set of tables hT
1
. . . T
t
i. Each table
T
i
has one or more partitioning attributes hA
i
1
, . . . , A
i
h
i, which
are a subset of the total set of attributes of T
i
. Tuples are hori-
zontally partitioned across a set of servers s
1
, . . . , s
j
. All tuples
of table T
i
with the same values of their partitioning attributes
hA
i
1
= x
1
, . . . , A
i
h
= x
h
i are placed on the same server and are
modeled as a vertex. The database sample is represented as a set
of vertices V = {v
1
, . . . , v
n
}, where each vertex v has a weight
w( v) denoting how frequently the vertex is accessed. Co-accesses
between two vertices are modeled as an edge, whose weight de-
notes the frequency of the co-accesses. We call the resulting graph
G(V, E) the heat graph having vertices in V and edges in E.
Data placement is driven by a partitioning plan P : V Π
that maps each vertex to a partition in Π based on the value of its
partitioning attributes. A single partition can correspond to a server,
or multiple partitions can be statically mapped onto a single server.
4.1 Incremental Data Placement Problem
Clay solves an incremental data placement problem that can be
formulated as follows. The system starts from an initial plan P .
Let L
P
(p) be the load of partition p in the plan P , let ǫ be the
percentage of load imbalance that we allow in the system, and let
θ be the average load across all partitions in the plan P multiplied
447

by 1 + ǫ. Let P be the current partitioning plan, P
be the next
partitioning plan identified by Clay, and ∆(P, P
) be the number
of vertices mapped to a different partition in P and P
. Given this,
the system seeks to minimize the following objective function:
minimize |P
|, ∆(P, P
) (1)
s.t. p Π : L
P
(p) < θ
We specify the two objectives in order of priority. First, we min-
imize the number of partitions in P
. Second, we minimize the
amount of data movement among the solutions with the same num-
ber of partitions. In either case, we limit the load imbalance to be
at most ǫ. We define w as the weight of a vertex or edge, E as
the set of edges in the heat graph, and k > 0 as a constant that
indicates the cost of multi-partition tuple accesses, which require
additional coordination. Given this, the load of a partition p Π
in a partitioning plan P is expressed as follows:
L
P
(p) =
X
vV
P (v)=p
w( v) +
X
vV
uV
P (v)=p
hv,ui∈E
P (u)6=p
w( hv, u i) · k (2)
The parameter k indicates how much to prioritize solutions that
minimize distributed transactions over ones that balance tuple ac-
cesses. Increasing k gives greater weight to the number of dis-
tributed transactions in the determination of the load of a partition.
4.2 Comparison with Graph Partitioning
We now revisit the issue of comparing Clay with graph partition-
ing techniques, and in particular to the common variant solved by
Metis [16] in Schism [5]. The incremental data placement prob-
lem is different from constrained n-way graph partitioning on the
heat graph, where n is the number of database partitions. The first
distinction is incrementality, since a graph partitioner ignores the
previous plan P and produces a new plan P
from scratch. By com-
puting a new P
, the DBMS may have to shuffle data to transition
from P to P
, which will degrade its performance. We contend,
however, that the difference is not limited to incrementality.
Graph partitioning produces a plan P
that minimizes the num-
ber of edges across partitions under the constraint of a maximum
load imbalance among partitions. The load of a partition is ex-
pressed as the sum of the weights of the vertices in the partition:
ˆ
L
P
(p) =
X
vV
P (v)=p
w( v) (3)
To be more precise, consider the Metis graph partitioner that
solves the following problem:
minimize |{hv, ui E : P
(v) 6= P
(u)}| (4)
s.t. p, q Π :
ˆ
L
P
(p)/
ˆ
L
P
(q) < 1 + η
where η is an imbalance constraint provided by the user.
The load balancing constraint is over a load function,
ˆ
L
P
, which
does not take into account the cost of distributed transactions. In
graph terms, the function does not take into account the load caused
by cross-partition edges. This is in contrast with the definitions of
Equations 1 and 2, where the load threshold considers at the same
time both local and remote tuple accesses and their respective cost.
The formulation of Equations 1 and 2 has two advantages over
Equations 3 and 4. The first is that the former constrains the num-
ber of cross-partition edges per partition, whereas Equation 4 min-
imizes the total number of cross-partition edges. Therefore, Equa-
tion 4 could create a “star” partitioning setting where all cross-
partition edges, and thus all distributed transactions, are incident
on a single server, causing that server to be highly overloaded.
!"#$%&'()*"+(,-.+/*$+01*%&-
2345-!6/*%7-8%9,9:-;'!*)$%<
4$#(/#=*+)(-
>)(+*)$-#(&-
;+,"'3%?%@-
!6/*%7->)(+*)$
A@17B-
>+,$#*+)(
Current'partition'plan
Transactional'access'trace
C%=)(D+,1$#*+)(-E(,+(%-
8%9,9:-!F1#@@<
Reconfiguration'plan
Intercept'SQL'query'
processing
Figure 3: The Clay framework.
The second advantage of our formulation is that it combines the
two conflicting goals of reaching balanced tuple accesses and min-
imizing distributed transactions using a single load function. Con-
sidering the two goals as separate makes it difficult to find a good
level of η, as our experiments show. In fact, if the threshold is
too low, Metis creates a balanced load in terms of single partition
transactions, but it also causes many distributed transactions. If the
threshold is too large, Metis causes fewer distributed transactions,
but then the load is not necessarily balanced.
One could consider using L
P
instead of
ˆ
L
P
as the expression of
load in Equation 4. Unfortunately, this is not possible since vertex
weights need to be provided as an input to the graph partitioner,
whereas the number of cross-partition edges depends on the out-
come of the partitioning itself.
5. SYSTEM ARCHITECTURE
Clay runs on top of a distributed OLTP DBMS and a reconfig-
uration engine that can dynamically change the data layout of the
database (see Figure 3). The monitoring component of Clay is acti-
vated whenever performance objectives are not met (e.g., when the
latency of the system does not meet an SLA). If these conditions
occur, Clay starts a transaction monitor that collects detailed work-
load monitoring information (Section 6). This information is sent
to a centralized reconfiguration controller that builds a heat graph.
The controller then runs our migration algorithm that builds clumps
on the fly and determines how to migrate them (Section 7).
Although Clay’s mechanisms are generic, the implementation
that we use in our evaluation is based on the H-Store system [12,
15]. H-Store is a distributed, in-memory DBMS that is optimized
for OLTP workloads and assumes that most transactions are short-
lived and datasets are easily partitioned. The original H-Store de-
sign supports a static configuration where the set of partitions and
hosts and the mapping between tuples and partitions are all fixed.
The E-Store [28] system relaxes some of these restrictions by al-
lowing for a dynamic number of partitions and nodes. E-Store also
changes how tuples are mapped to partitions by using a two-tiered
partitioning scheme that uses fine-grained partitioning (e.g., range
partitioning) for a set of “hot” tuples and then a simple scheme
(e.g., range partitioning of large chunks or hash partitioning) for
large blocks of “cold” tuples. Clay uses this same two-tier partition-
ing scheme in H-Store. It also uses Squall for reconfiguration [8],
although its techniques are agnostic to it.
6. TRANSACTION MONITORING
The data placement problem of Section 4 models a database as a
weighted graph. The monitoring component collects the necessary
information to build the graph: it counts the number of accesses to
tuples (vertices) and the co-accesses (edges) among tuples.
448

Monitoring tracks tuple accesses by hooking onto the transaction
routing module. When processing a transaction, H-Store breaks
SQL statements into smaller fragments that execute low-level op-
erations. It then routes these fragments to one or more partitions
based on the values of the partitioning attributes of the tuples that
are accessed by the fragment.
Clay performs monitoring by adding hooks in the DBMS’s query
processing components that extract the values of the partitioning
attributes used to route the fragments. These values correspond to
specific vertices of the graph, as discussed in Section 4. The mon-
itoring component is executed by each server and writes tuple ac-
cesses onto a monitoring file using the format htid, T, x
1
, . . . , x
h
i,
where tid is a unique id associated with the transactions perform-
ing the access, T is the table containing the accessed tuple, h is the
number of partitioning attributes of table T , and x
i
is the value of
the i
th
partitioning attribute in the accessed tuple. When a transac-
tion is completed, monitoring adds an entry hEND, tidi.
Query-level monitoring captures more detailed information than
related approaches. It is able to determine not only which tuples are
accessed, but also which tuples are accessed together by the same
transaction. E-Store restricted monitoring to root tuples because of
the high cost of using low-level operations to track access patterns
for single tuples [28]. Our evaluation shows that Clay’s monitoring
is more accurate and has low overhead.
Clay disables monitoring during normal transaction execution
and only turns it on when some application-specific objectives are
violated (e.g., if the 99
th
percentile latency exceeds a pre-defined
target). After being turned on, monitoring remains active for a short
time. Our experiments established that 20 seconds was sufficient to
detect frequently-accessed hot tuples.
Once a server terminates monitoring, it sends the collected data
to a centralized controller that builds the heat graph (see Figure 2)
and computes the new plan. For every access to a tuple/vertex v
found in the monitoring data, the controller increments vs weight
by one divided by the length of the monitoring interval for the
server, to reflect the rate of transaction accesses. Vertices accessed
by the same transactions are connected by an edge whose weight is
computed similarly to a vertex weight.
7. CLUMP MIGRATION
The clump migration algorithm is the central component of Clay.
It takes as input the current partitioning plan P , which maps tu-
ples to partitions, and the heat graph G produced by the monitoring
component. Its output is a new partitioning plan (see Section 4).
We now describe the algorithm more in detail.
7.1 Dealing with Overloaded Partitions
The clump migration algorithm of Clay starts by identifying the
set of “overloaded” partitions that have a load higher than a thresh-
old θ. The load per partition is defined by the formula in Equation 2
(we used a value of k = 50 in all our experiments since we found
that distributed transactions impact performance much more than
local tuple accesses). For each overloaded partition P
o
, the migra-
tion algorithm dynamically defines and migrates clumps until the
load of P
o
is below the θ threshold (see Algorithm 1). A clump
created to offload a partition P
o
will contain some tuples of P
o
but
it can also contain tuples from other partitions. A move is a pair
consisting of a clump and a destination partition.
Initializing a clump. The algorithm starts with an empty clump
M. It then sets M to be the hottest vertex h in the hot tuples list
H(P
o
), which contains the most frequently accessed vertices for
Algorithm 1: Migration algorithm to offload partition P
o
look-ahead A;
while L(P
o
) > θ do
if M = then
// initialize the clump
h next hot tuple in H(P
o
);
M {h};
d initial-partition(M );
else if some vertex in M has neighbors then
// expand the clump
M M most-co-accessed-neighbor(M, G);
d update-dest(M, d);
else
// cannot expand the clump anymore
if C 6= then
move C.M to C.d;
M ;
look-ahead A;
else
add a new server and restart the algorithm;
// examine the new clump
if feasible(M, d) then
C.M M;
C.d d;
else if C 6= then
look-ahead look-ahead 1;
if look-ahead = 0 then
move C.M to C.d;
M ;
look-ahead A;
Algorithm 2: Finding the best destination for M
function update-dest(M, d)
if ¬ feasible(M, d) then
a partition most frequently accessed with M;
if a 6= d feasible(M, a) then
return a ;
l least loaded partition;
if
l 6= d
r
(M, a) <
r
(M, l)
feasible(M, l)
then
return l
return d ;
each partition (i.e., those having the highest weight, in descending
order of access frequency).
The algorithm then picks the destination partition that minimizes
the overall load of the system. The function initial-partition se-
lects the destination partition d having the lowest receiver delta
r
(M, d), where
r
(M, d) is defined as the load of d after re-
ceiving M minus the load of d before receiving M. Given the way
the load function is defined (see Equation 2), the partition with the
lowest receiver delta is the one whose tuples are most frequently
co-accessed with the tuples in M , so moving M to that partition
minimizes the number of distributed transactions. The initial selec-
tion of d prioritizes partitions that do not become overloaded after
the move, if available. Among partitions with the same receiver
delta, the heuristic selects the one with the lowest overall load. In
systems like H-Store that run multiple partitions on the same physi-
cal server, the cost function assigns a lower cost to transactions that
access partitions in the same server than to distributed transactions.
Expanding a clump. If M is not empty, it is extended with the
neighboring tuple of M that is most frequently co-accessed with
a tuple in M. This is found by iterating over all the neighbors of
449

Citations
More filters
Journal ArticleDOI

Searchable Symmetric Encryption with Forward Search Privacy

TL;DR: The hidden pointer technique is developed and a new SSE scheme called Khons is proposed, which satisfies the security notion (with the original forward privacy notion) and is also efficient and implemented and results show that it is more efficient than existing SSE schemes with forward privacy.
Proceedings ArticleDOI

CockroachDB: The Resilient Geo-Distributed SQL Database

TL;DR: The design of CockroachDB and its novel transaction model that supports consistent geo-distributed transactions on commodity hardware is presented and its distributed SQL layer automatically scales with the size of the database cluster while providing the standard SQL interface that users expect.
Proceedings ArticleDOI

P-Store: An Elastic Database System with Predictive Provisioning

TL;DR: P-Store is presented, the first elastic OLTP DBMS to use prediction, and it is shown that P-Store outperforms a reactive system on B2W's workload by causing 72% fewer latency violations, and achieves performance comparable to static allocation for peak demand while using 50% fewer servers.
Proceedings ArticleDOI

Qd-tree: Learning Data Layouts for Big Data Analytics

TL;DR: Experiments over benchmark and real workloads show that a qd-tree can provide physical speedups of more than an order of magnitude compared to current blocking schemes, and can reach within 2X of the lower bound for data skipping based on selectivity, while providing complete semantic descriptions of created blocks.
Journal ArticleDOI

Optimal column layout for hybrid workloads

TL;DR: An in-memory storage engine, Casper, is built and it is shown that it outperforms state-of-the-art data layouts of analytical systems for hybrid workloads and how to make data layout decisions robust to workload variation by carefully selecting the input of the optimization.
References
More filters
Proceedings Article

Measuring User Influence in Twitter: The Million Follower Fallacy

TL;DR: An in-depth comparison of three measures of influence, using a large amount of data collected from Twitter, is presented, suggesting that topological measures such as indegree alone reveals very little about the influence of a user.
Posted Content

Social networks that matter: Twitter under the microscope

TL;DR: In this paper, a study of social interactions within Twitter reveals that the driver of usage is a sparse and hidden network of connections underlying the declared set of friends and followers, revealing that the linked structures of social networks do not reveal actual interactions among people.
Proceedings Article

Megastore: Providing Scalable, Highly Available Storage for Interactive Services

TL;DR: Megastore provides fully serializable ACID semantics within ne-grained partitions of data, which allows us to synchronously replicate each write across a wide area network with reasonable latency and support seamless failover between datacenters.
Journal ArticleDOI

Social networks that matter: Twitter under the microscope

TL;DR: A study of social interactions within Twitter reveals that the driver of usage is a sparse and hidden network of connections underlying the "declared" set of friends and followers as mentioned in this paper, revealing that the linked structures of social networks do not reveal actual interactions among people.
Journal ArticleDOI

Schism: a workload-driven approach to database replication and partitioning

TL;DR: Schism consistently outperforms simple partitioning schemes, and in some cases proves superior to the best known manual partitioning, reducing the cost of distributed transactions up to 30%.
Related Papers (5)