scispace - formally typeset
Open AccessJournal ArticleDOI

Scalable SQL and NoSQL data stores

Rick Cattell
- Vol. 39, Iss: 4, pp 12-27
TLDR
This paper examines a number of SQL and socalled "NoSQL" data stores designed to scale simple OLTP-style application loads over many servers, and contrasts the new systems on their data model, consistency mechanisms, storage mechanisms, durability guarantees, availability, query support, and other dimensions.
Abstract
In this paper, we examine a number of SQL and socalled "NoSQL" data stores designed to scale simple OLTP-style application loads over many servers. Originally motivated by Web 2.0 applications, these systems are designed to scale to thousands or millions of users doing updates as well as reads, in contrast to traditional DBMSs and data warehouses. We contrast the new systems on their data model, consistency mechanisms, storage mechanisms, durability guarantees, availability, query support, and other dimensions. These systems typically sacrifice some of these dimensions, e.g. database-wide transaction consistency, in order to achieve others, e.g. higher availability and scalability.

read more

Content maybe subject to copyright    Report

Scalable SQL and NoSQL Data Stores
Rick Cattell
Originally published in 2010, last
revised December 2011
ABSTRACT
In this paper, we examine a number of SQL and so-
called “NoSQL data stores designed to scale simple
OLTP-style application loads over many servers.
Originally motivated by Web 2.0 applications, these
systems are designed to scale to thousands or millions
of users doing updates as well as reads, in contrast to
traditional DBMSs and data warehouses. We contrast
the new systems on their data model, consistency
mechanisms, storage mechanisms, durability
guarantees, availability, query support, and other
dimensions. These systems typically sacrifice some of
these dimensions, e.g. database-wide transaction
consistency, in order to achieve others, e.g. higher
availability and scalability.
Note: Bibliographic references for systems are not
listed, but URLs for more information can be found in
the System References table at the end of this paper.
Caveat: Statements in this paper are based on sources
and documentation that may not be reliable, and the
systems described are “moving targets,” so some
statements may be incorrect. Verify through other
sources before depending on information here.
Nevertheless, we hope this comprehensive survey is
useful! Check for future corrections on the author’s
web site cattell.net/datastores.
Disclosure: The author is on the technical advisory
board of Schooner Technologies and has a consulting
business advising on scalable databases.
1. OVERVIEW
In recent years a number of new systems have been
designed to provide good horizontal scalability for
simple read/write database operations distributed over
many servers. In contrast, traditional database
products have comparatively little or no ability to scale
horizontally on these applications. This paper
examines and compares the various new systems.
Many of the new systems are referred to as NoSQL”
data stores. The definition of NoSQL, which stands
for “Not Only SQL or “Not Relational”, is not
entirely agreed upon. For the purposes of this paper,
NoSQL systems generally have six key features:
1. the ability to horizontally scale “simple
operation” throughput over many servers,
2. the ability to replicate and to distribute (partition)
data over many servers,
3. a simple call level interface or protocol (in
contrast to a SQL binding),
4. a weaker concurrency model than the ACID
transactions of most relational (SQL) database
systems,
5. efficient use of distributed indexes and RAM for
data storage, and
6. the ability to dynamically add new attributes to
data records.
The systems differ in other ways, and in this paper we
contrast those differences. They range in functionality
from the simplest distributed hashing, as supported by
the popular memcached open source cache, to highly
scalable partitioned tables, as supported by Googles
BigTable [1]. In fact, BigTable, memcached, and
Amazon’s Dynamo [2] provided a proof of concept
that inspired many of the data stores we describe here:
Memcached demonstrated that in-memory indexes
can be highly scalable, distributing and replicating
objects over multiple nodes.
Dynamo pioneered the idea of eventual
consistency as a way to achieve higher availability
and scalability: data fetched are not guaranteed to
be up-to-date, but updates are guaranteed to be
propagated to all nodes eventually.
BigTable demonstrated that persistent record
storage could be scaled to thousands of nodes, a
feat that most of the other systems aspire to.
A key feature of NoSQL systems is “shared nothing”
horizontal scaling replicating and partitioning data
over many servers. This allows them to support a large
number of simple read/write operations per second.
This simple operation load is traditionally called OLTP
(online transaction processing), but it is also common
in modern web applications
The NoSQL systems described here generally do not
provide ACID transactional properties: updates are
eventually propagated, but there are limited guarantees
on the consistency of reads. Some authors suggest a
“BASE” acronym in contrast to the “ACID” acronym:
BASE = Basically Available, Soft state,
Eventually consistent
ACID = Atomicity, Consistency, Isolation, and
Durability
The idea is that by giving up ACID constraints, one
can achieve much higher performance and scalability.

However, the systems differ in how much they give up.
For example, most of the systems call themselves
“eventually consistent”, meaning that updates are
eventually propagated to all nodes, but many of them
provide mechanisms for some degree of consistency,
such as multi-version concurrency control (MVCC).
Proponents of NoSQL often cite Eric Brewer’s CAP
theorem [4], which states that a system can have only
two out of three of the following properties:
consistency, availability, and partition-tolerance. The
NoSQL systems generally give up consistency.
However, the trade-offs are complex, as we will see.
New relational DBMSs have also been introduced to
provide better horizontal scaling for OLTP, when
compared to traditional RDBMSs. After examining
the NoSQL systems, we will look at these SQL
systems and compare the strengths of the approaches.
The SQL systems strive to provide horizontal
scalability without abandoning SQL and ACID
transactions. We will discuss the trade-offs here.
In this paper, we will refer to both the new SQL and
NoSQL systems as data stores, since the term
“database system” is widely used to refer to traditional
DBMSs. However, we will still use the term
“database” to refer to the stored data in these systems.
All of the data stores have some administrative unit
that you would call a database: data may be stored in
one file, or in a directory, or via some other
mechanism that defines the scope of data used by a
group of applications. Each database is an island unto
itself, even if the database is partitioned and distributed
over multiple machines: there is no “federated
database” concept in these systems (as with some
relational and object-oriented databases), allowing
multiple separately-administered databases to appear
as one. Most of the systems allow horizontal
partitioning of data, storing records on different servers
according to some key; this is called “sharding”. Some
of the systems also allow vertical partitioning, where
parts of a single record are stored on different servers.
1.1 Scope of this Paper
Before proceeding, some clarification is needed in
defining “horizontal scalability and simple
operations”. These define the focus of this paper.
By “simple operations”, we refer to key lookups, reads
and writes of one record or a small number of records.
This is in contrast to complex queries or joins, read-
mostly access, or other application loads. With the
advent of the web, especially Web 2.0 sites where
millions of users may both read and write data,
scalability for simple database operations has become
more important. For example, applications may search
and update multi-server databases of electronic mail,
personal profiles, web postings, wikis, customer
records, online dating records, classified ads, and many
other kinds of data. These all generally fit the
definition of “simple operation” applications: reading
or writing a small number of related records in each
operation.
The term “horizontal scalability means the ability to
distribute both the data and the load of these simple
operations over many servers, with no RAM or disk
shared among the servers. Horizontal scaling differs
from “vertical” scaling, where a database system
utilizes many cores and/or CPUs that share RAM and
disks. Some of the systems we describe provide both
vertical and horizontal scalability, and the effective use
of multiple cores is important, but our main focus is on
horizontal scalability, because the number of cores that
can share memory is limited, and horizontal scaling
generally proves less expensive, using commodity
servers. Note that horizontal and vertical partitioning
are not related to horizontal and vertical scaling,
except that they are both useful for horizontal scaling.
1.2 Systems Beyond our Scope
Some authors have used a broad definition of NoSQL,
including any database system that is not relational.
Specifically, they include:
Graph database systems: Neo4j and OrientDB
provide efficient distributed storage and queries of
a graph of nodes with references among them.
Object-oriented database systems: Object-oriented
DBMSs (e.g., Versant) also provide efficient
distributed storage of a graph of objects, and
materialize these objects as programming
language objects.
Distributed object-oriented stores: Very similar to
object-oriented DBMSs, systems such as GemFire
distribute object graphs in-memory on multiple
servers.
These systems are a good choice for applications that
must do fast and extensive reference-following,
especially where data fits in memory. Programming
language integration is also valuable. Unlike the
NoSQL systems, these systems generally provide
ACID transactions. Many of them provide horizontal
scaling for reference-following and distributed query
decomposition, as well. Due to space limitations,
however, we have omitted these systems from our
comparisons. The applications and the necessary
optimizations for scaling for these systems differ from
the systems we cover here, where key lookups and
simple operations predominate over reference-
following and complex object behavior. It is possible
these systems can scale on simple operations as well,
but that is a topic for a future paper, and proof through
benchmarks.

Data warehousing database systems provide horizontal
scaling, but are also beyond the scope of this paper.
Data warehousing applications are different in
important ways:
They perform complex queries that collect and
join information from many different tables.
The ratio of reads to writes is high: that is, the
database is read-only or read-mostly.
There are existing systems for data warehousing that
scale well horizontally. Because the data is
infrequently updated, it is possible to organize or
replicate the database in ways that make scaling
possible.
1.3 Data Model Terminology
Unlike relational (SQL) DBMSs, the terminology used
by NoSQL data stores is often inconsistent. For the
purposes of this paper, we need a consistent way to
compare the data models and functionality.
All of the systems described here provide a way to
store scalar values, like numbers and strings, as well as
BLOBs. Some of them also provide a way to store
more complex nested or reference values. The systems
all store sets of attribute-value pairs, but use different
data structures, specifically:
A tuple” is a row in a relational table, where
attribute names are pre-defined in a schema, and
the values must be scalar. The values are
referenced by attribute name, as opposed to an
array or list, where they are referenced by ordinal
position.
A “document” allows values to be nested
documents or lists as well as scalar values, and the
attribute names are dynamically defined for each
document at runtime. A document differs from a
tuple in that the attributes are not defined in a
global schema, and this wider range of values are
permitted.
An “extensible record” is a hybrid between a tuple
and a document, where families of attributes are
defined in a schema, but new attributes can be
added (within an attribute family) on a per-record
basis. Attributes may be list-valued.
An “object” is analogous to an object in
programming languages, but without the
procedural methods. Values may be references or
nested objects.
1.4 Data Store Categories
In this paper, the data stores are grouped according to
their data model:
Key-value Stores: These systems store values and
an index to find them, based on a programmer-
defined key.
Document Stores: These systems store documents,
as just defined. The documents are indexed and a
simple query mechanism is provided.
Extensible Record Stores: These systems store
extensible records that can be partitioned
vertically and horizontally across nodes. Some
papers call these “wide column stores”.
Relational Databases: These systems store (and
index and query) tuples. The new RDBMSs that
provide horizontal scaling are covered in this
paper.
Data stores in these four categories are covered in the
next four sections, respectively. We will then
summarize and compare the systems.
2. KEY-VALUE STORES
The simplest data stores use a data model similar to the
popular memcached distributed in-memory cache, with
a single key-value index for all the data. We’ll call
these systems key-value stores. Unlike memcached,
these systems generally provide a persistence
mechanism and additional functionality as well:
replication, versioning, locking, transactions, sorting,
and/or other features. The client interface provides
inserts, deletes, and index lookups. Like memcached,
none of these systems offer secondary indices or keys.
2.1 Project Voldemort
Project Voldemort is an advanced key-value store,
written in Java. It is open source, with substantial
contributions from LinkedIn. Voldemort provides
multi-version concurrency control (MVCC) for
updates. It updates replicas asynchronously, so it does
not guarantee consistent data. However, it can
guarantee an up-to-date view if you read a majority of
replicas.
Voldemort supports optimistic locking for consistent
multi-record updates: if updates conflict with any other
process, they can be backed out. Vector clocks, as
used in Dynamo [3], provide an ordering on versions.
You can also specify which version you want to
update, for the put and delete operations.
Voldemort supports automatic sharding of data.
Consistent hashing is used to distribute data around a
ring of nodes: data hashed to node K is replicated on
node K+1 K+n where n is the desired number of
extra copies (often n=1). Using good sharding
technique, there should be many more “virtual” nodes
than physical nodes (servers). Once data partitioning
is set up, its operation is transparent. Nodes can be
added or removed from a database cluster, and the
system adapts automatically. Voldemort automatically
detects and recovers failed nodes.

Voldemort can store data in RAM, but it also permits
plugging in a storage engine. In particular, it supports
a Berkeley DB and Random Access File storage
engine. Voldemort supports lists and records in
addition to simple scalar values.
2.2 Riak
Riak is written in Erlang. It was open-sourced by
Basho in mid-2009. Basho alternately describes Riak
as a “key-value store” and “document store”. We will
categorize it as an advanced key-value store here,
because it lacks important features of document stores,
but it (and Voldemort) have more functionality than
the other key-value stores:
Riak objects can be fetched and stored in JSON
format, and thus can have multiple fields (like
documents), and objects can be grouped into
buckets, like the collections supported by
document stores, with allowed/required fields
defined on a per-bucket basis.
Riak does not support indices on any fields except
the primary key. The only thing you can do with
the non-primary fields is fetch and store them as
part of a JSON object. Riak lacks the query
mechanisms of the document stores; the only
lookup you can do is on primary key.
Riak supports replication of objects and sharding by
hashing on the primary key. It allows replica values to
be temporarily inconsistent. Consistency is tunable by
specifying how many replicas (on different nodes)
must respond for a successful read and how many must
respond for a successful write. This is per-read and
per-write, so different parts of an application can
choose different trade-offs.
Like Voldemort, Riak uses a derivative of MVCC
where vector clocks are assigned when values are
updated. Vector clocks can be used to determine when
objects are direct descendents of each other or a
common parent, so Riak can often self-repair data that
it discovers to be out of sync.
The Riak architecture is symmetric and simple. Like
Voldemort, it uses consistent hashing. There is no
distinguished node to track status of the system: the
nodes use a gossip protocol to track who is alive and
who has which data, and any node may service a client
request. Riak also includes a map/reduce mechanism
to split work over all the nodes in a cluster.
The client interface to Riak is based on RESTful HTTP
requests. REST (REpresentational State Transfer) uses
uniform, stateless, cacheable, client-server calls. There
is also a programmatic interface for Erlang, Java, and
other languages.
The storage part of Riak is “pluggable”: the key-value
pairs may be in memory, in ETS tables, in DETS
tables, or in Osmos tables. ETS, DETS, and Osmos
tables are all implemented in Erlang, with different
performance and properties.
One unique feature of Riak is that it can store “links
between objects (documents), for example to link
objects for authors to the objects for the books they
wrote. Links reduce the need for secondary indices,
but there is still no way to do range queries.
Here’s an example of a Riak object described in JSON:
{
"bucket":"customers",
"key":"12345",
"object":{
"name":"Mr. Smith",
"phone":”415-555-6524” }
"links":[
["sales","Mr. Salesguy","salesrep"],
["cust-orders","12345","orders"] ]
"vclock":"opaque-riak-vclock",
"lastmod":"Mon, 03 Aug 2009 18:49:42 GMT"
}
Note that the primary key is distinguished, while other
fields are part of anobject portion. Also note that
the bucket, vector clock, and modification date is
specified as part of the object, and links to other
objects are supported.
2.3 Redis
The Redis key-value data store started as a one-person
project but now has multiple contributors as BSD-
licensed open source. It is written in C.
A Redis server is accessed by a wire protocol
implemented in various client libraries (which must be
updated when the protocol changes). The client side
does the distributed hashing over servers. The servers
store data in RAM, but data can be copied to disk for
backup or system shutdown. System shutdown may be
needed to add more nodes.
Like the other key-value stores, Redis implements
insert, delete and lookup operations. Like Voldemort,
it allows lists and sets to be associated with a key, not
just a blob or string. It also includes list and set
operations.
Redis does atomic updates by locking, and does
asynchronous replication. It is reported to support
about 100K gets/sets per second on an 8-core server.
2.4 Scalaris
Scalaris is functionally similar to Redis. It was written
in Erlang at the Zuse Institute in Berlin, and is open
source. In distributing data over nodes, it allows key
ranges to be assigned to nodes, rather than simply
hashing to nodes. This means that a query on a range
of values does not need to go to every node, and it also
may allow better load balancing, depending on key
distribution.

Like the other key-value stores, it supports insert,
delete, and lookup. It does replication synchronously
(copies must be updated before the operation is
complete) so data is guaranteed to be consistent.
Scalaris also supports transactions with ACID
properties on multiple objects. Data is stored in
memory, but replication and recovery from node
failures provides durability of the updates.
Nevertheless, a multi-node power failure would cause
disastrous loss of data, and the virtual memory limit
sets a maximum database size.
Scalaris reads and writes must go to a majority of the
replicas before an operation completes. Scalaris uses a
ring of nodes, an unusual distribution and replication
strategy that requires log(N) hops to read/write a key-
value pair.
2.5 Tokyo Cabinet
Tokyo Cabinet / Tokyo Tyrant was a sourcefourge.net
project, but is now licensed and maintained by FAL
Labs. Tokyo Cabinet is the back-end server, Tokyo
Tyrant is a client library for remote access. Both are
written in C.
There are six different variations for the Tokyo
Cabinet server: hash indexes in memory or on disk, B-
trees in memory or on disk, fixed-size record tables,
and variable-length record tables. The engines
obviously differ in their performance characteristics,
e.g. the fixed-length records allow quick lookups.
There are slight variations on the API supported by
these engines, but they all support common
get/set/update operations. The documentation is a bit
unclear, but they claim to support locking, ACID
transactions, a binary array data type, and more
complex update operations to atomically update a
number or concatenate to a string. They support
asynchronous replication with dual master or
master/slave. Recovery of a failed node is manual, and
there is no automatic sharding.
2.6 Memcached, Membrain, and
Membase
The memcached open-source distributed in-memory
indexing system has been enhanced by Schooner
Tehnologies and Membase, to include features
analogous to the other key-value stores: persistence,
replication, high availability, dynamic growth, backup,
and so on. Without persistence or replication,
memcached does not really qualify as a “data store”.
However, Membrain and Membase certainly do, and
these systems are also compatible with existing
memcached applications. This compatibility is an
attractive feature, given that memcached is widely
used; memcached users that require more advanced
features can easily upgrade to Membase and
Membrain.
The Membase system is open source, and is supported
by the company Membase. Its most attractive feature
is probably its ability to elastically add or remove
servers in a running system, moving data and
dynamically redirecting requests in the meantime. The
elasticity in most of the other systems is not as
convenient.
Membrain is licensed per server, and is supported by
Schooner Technologies. Its most attractive feature is
probably its excellent tuning for flash memory. The
performance gains of flash memory will not be gained
in other systems by treating flash as a faster hard disk;
it is important that the system treat flash as a true
“third tier”, different from RAM and disk. For
example, many systems have substantial overhead in
buffering and caching hard disk pages; this is
unnecessary overhead with flash. The benchmark
results on Schooner’s web site show many times better
performance than a number of competitors, particularly
when data overflows RAM.
2.7 Summary
All the key-value stores support insert, delete, and
lookup operations. All of these systems provide
scalability through key distribution over nodes.
Voldemort, Riak, Tokyo Cabinet, and enhanced
memcached systems can store data in RAM or on disk,
with storage add-ons. The others store data in RAM,
and provide disk as backup, or rely on replication and
recovery so that a backup is not needed.
Scalaris and enhanced memcached systems use
synchronous replication, the rest use asynchronous.
Scalaris and Tokyo Cabinet implement transactions,
while the others do not.
Voldemort and Riak use multi-version concurrency
control (MVCC), the others use locks.
Membrain and Membase are built on the popular
memcached system, adding persistence, replication,
and other features. Backward compatibility with
memcached give these products an advantage.
3. DOCUMENT STORES
As discussed in the first section, document stores
support more complex data than the key-value stores.
The term “document store” may be confusing: while
these systems could store “documents” in the
traditional sense (articles, Microsoft Word files, etc.), a
document in these systems can be any kind of
“pointerless object”, consistent with our definition in
Section 1. Unlike the key-value stores, these systems
generally support secondary indexes and multiple
types of documents (objects) per database, and nested
documents or lists. Like other NoSQL systems, the

Citations
More filters
Journal ArticleDOI

Big Data: A Survey

TL;DR: The background and state-of-the-art of big data are reviewed, including enterprise management, Internet of Things, online social networks, medial applications, collective intelligence, and smart grid, as well as related technologies.
Journal ArticleDOI

The rise of big data on cloud computing

TL;DR: The definition, characteristics, and classification of big data along with some discussions on cloud computing are introduced, and research challenges are investigated, with focus on scalability, availability, data integrity, data transformation, data quality, data heterogeneity, privacy, legal and regulatory issues, and governance.
Journal ArticleDOI

The Internet of Things vision: Key features, applications and open issues

TL;DR: This paper presents the key features and the driver technologies of IoT, and identifies the application scenarios and the correspondent potential applications, and focuses on research challenges and open issues to be faced for the IoT realization in the real world.
Journal ArticleDOI

Toward Scalable Systems for Big Data Analytics: A Technology Tutorial

TL;DR: This paper presents a systematic framework to decompose big data systems into four sequential modules, namely data generation, data acquisition, data storage, and data analytics, and presents the prevalent Hadoop framework for addressing big data challenges.
Journal ArticleDOI

Inter-Cloud architectures and application brokering: taxonomy and survey

TL;DR: This work proposes and motivate taxonomies for Inter‐Cloud architectures and application brokering mechanisms, and presents a detailed survey of the state of the art in terms of both academic and industry developments.
References
More filters
Proceedings Article

Bigtable: A Distributed Storage System for Structured Data (Awarded Best Paper!).

TL;DR: Bigtable as mentioned in this paper is a distributed storage system for managing structured data that is designed to scale to a very large size: petabytes of data across thousands of commodity servers, including web indexing, Google Earth and Google Finance.
Proceedings ArticleDOI

Dynamo: amazon's highly available key-value store

TL;DR: D Dynamo is presented, a highly available key-value storage system that some of Amazon's core services use to provide an "always-on" experience and makes extensive use of object versioning and application-assisted conflict resolution in a manner that provides a novel interface for developers to use.
Proceedings ArticleDOI

Benchmarking cloud serving systems with YCSB

TL;DR: This work presents the "Yahoo! Cloud Serving Benchmark" (YCSB) framework, with the goal of facilitating performance comparisons of the new generation of cloud data serving systems, and defines a core set of benchmarks and reports results for four widely used systems.
Proceedings ArticleDOI

Bigtable: a distributed storage system for structured data

TL;DR: Bigtable as discussed by the authors is a distributed storage system for managing structured data that is designed to scale to a very large size: petabytes of data across thousands of commodity servers, including web indexing, Google Earth and Google Finance.
Journal ArticleDOI

Brewer's conjecture and the feasibility of consistent, available, partition-tolerant web services

TL;DR: In this paper, it is shown that it is impossible to achieve consistency, availability, and partition tolerance in the asynchronous network model, and then solutions to this dilemma in the partially synchronous model are discussed.
Related Papers (5)
Frequently Asked Questions (9)
Q1. What are the contributions in "Scalable sql and nosql data stores" ?

In this paper, the authors examine a number of SQL and socalled “ NoSQL ” data stores designed to scale simple OLTP-style application loads over many servers. Note: Bibliographic references for systems are not listed, but URLs for more information can be found in the System References table at the end of this paper. Caveat: Statements in this paper are based on sources and documentation that may not be reliable, and the systems described are “ moving targets, ” so some statements may be incorrect. Nevertheless, the authors hope this comprehensive survey is useful ! Check for future corrections on the author ’ s web site cattell. The author is on the technical advisory board of Schooner Technologies and has a consulting business advising on scalable databases. 

Proponents of NoSQL often cite Eric Brewer’s CAP theorem [4], which states that a system can have only two out of three of the following properties: consistency, availability, and partition-tolerance. 

The memcached open-source distributed in-memory indexing system has been enhanced by Schooner Tehnologies and Membase, to include features analogous to the other key-value stores: persistence, replication, high availability, dynamic growth, backup, and so on. 

they include: • Graph database systems: Neo4j and OrientDBprovide efficient distributed storage and queries of a graph of nodes with references among them.• 

Consistent hashing is used to distribute data around a ring of nodes: data hashed to node K is replicated on node K+1 … K+n where n is the desired number of extra copies (often n=1). 

The simplest data stores use a data model similar to the popular memcached distributed in-memory cache, with a single key-value index for all the data. 

In recent years a number of new systems have been designed to provide good horizontal scalability for simple read/write database operations distributed over many servers. 

Voldemort supports optimistic locking for consistent multi-record updates: if updates conflict with any other process, they can be backed out. 

All of the data stores have some administrative unit that you would call a database: data may be stored in one file, or in a directory, or via some other mechanism that defines the scope of data used by a group of applications.