(Open Access) Scalable SQL and NoSQL data stores (2011) | Rick Cattell

Q: What are the contributions in "Scalable sql and nosql data stores" ?

In this paper, the authors examine a number of SQL and socalled “ NoSQL ” data stores designed to scale simple OLTP-style application loads over many servers. Note: Bibliographic references for systems are not listed, but URLs for more information can be found in the System References table at the end of this paper. Caveat: Statements in this paper are based on sources and documentation that may not be reliable, and the systems described are “ moving targets, ” so some statements may be incorrect. Nevertheless, the authors hope this comprehensive survey is useful ! Check for future corrections on the author ’ s web site cattell. The author is on the technical advisory board of Schooner Technologies and has a consulting business advising on scalable databases.

Q: What are the main components of a graph database?

they include: • Graph database systems: Neo4j and OrientDBprovide efficient distributed storage and queries of a graph of nodes with references among them.•

Scalable SQL and NoSQL Data Stores

Rick Cattell

Originally published in 2010, last

revised December 2011

ABSTRACT

In this paper, we examine a number of SQL and so-

called “NoSQL” data stores designed to scale simple

OLTP-style application loads over many servers.

Originally motivated by Web 2.0 applications, these

systems are designed to scale to thousands or millions

of users doing updates as well as reads, in contrast to

traditional DBMSs and data warehouses. We contrast

the new systems on their data model, consistency

mechanisms, storage mechanisms, durability

guarantees, availability, query support, and other

dimensions. These systems typically sacrifice some of

these dimensions, e.g. database-wide transaction

consistency, in order to achieve others, e.g. higher

availability and scalability.

Note: Bibliographic references for systems are not

listed, but URLs for more information can be found in

the System References table at the end of this paper.

Caveat: Statements in this paper are based on sources

and documentation that may not be reliable, and the

systems described are “moving targets,” so some

statements may be incorrect. Verify through other

sources before depending on information here.

Nevertheless, we hope this comprehensive survey is

useful! Check for future corrections on the author’s

web site cattell.net/datastores.

Disclosure: The author is on the technical advisory

board of Schooner Technologies and has a consulting

business advising on scalable databases.

1. OVERVIEW

In recent years a number of new systems have been

designed to provide good horizontal scalability for

simple read/write database operations distributed over

many servers. In contrast, traditional database

products have comparatively little or no ability to scale

horizontally on these applications. This paper

examines and compares the various new systems.

Many of the new systems are referred to as “NoSQL”

data stores. The definition of NoSQL, which stands

for “Not Only SQL” or “Not Relational”, is not

entirely agreed upon. For the purposes of this paper,

NoSQL systems generally have six key features:

1. the ability to horizontally scale “simple

operation” throughput over many servers,

2. the ability to replicate and to distribute (partition)

data over many servers,

3. a simple call level interface or protocol (in

contrast to a SQL binding),

4. a weaker concurrency model than the ACID

transactions of most relational (SQL) database

systems,

5. efficient use of distributed indexes and RAM for

data storage, and

6. the ability to dynamically add new attributes to

data records.

The systems differ in other ways, and in this paper we

contrast those differences. They range in functionality

from the simplest distributed hashing, as supported by

the popular memcached open source cache, to highly

scalable partitioned tables, as supported by Google’s

BigTable [1]. In fact, BigTable, memcached, and

Amazon’s Dynamo [2] provided a “proof of concept”

that inspired many of the data stores we describe here:

• Memcached demonstrated that in-memory indexes

can be highly scalable, distributing and replicating

objects over multiple nodes.

• Dynamo pioneered the idea of eventual

consistency as a way to achieve higher availability

and scalability: data fetched are not guaranteed to

be up-to-date, but updates are guaranteed to be

propagated to all nodes eventually.

• BigTable demonstrated that persistent record

storage could be scaled to thousands of nodes, a

feat that most of the other systems aspire to.

A key feature of NoSQL systems is “shared nothing”

horizontal scaling – replicating and partitioning data

over many servers. This allows them to support a large

number of simple read/write operations per second.

This simple operation load is traditionally called OLTP

(online transaction processing), but it is also common

in modern web applications

The NoSQL systems described here generally do not

provide ACID transactional properties: updates are

eventually propagated, but there are limited guarantees

on the consistency of reads. Some authors suggest a

“BASE” acronym in contrast to the “ACID” acronym:

• BASE = Basically Available, Soft state,

Eventually consistent

• ACID = Atomicity, Consistency, Isolation, and

Durability

The idea is that by giving up ACID constraints, one

can achieve much higher performance and scalability.

However, the systems differ in how much they give up.

For example, most of the systems call themselves

“eventually consistent”, meaning that updates are

eventually propagated to all nodes, but many of them

provide mechanisms for some degree of consistency,

such as multi-version concurrency control (MVCC).

Proponents of NoSQL often cite Eric Brewer’s CAP

theorem [4], which states that a system can have only

two out of three of the following properties:

consistency, availability, and partition-tolerance. The

NoSQL systems generally give up consistency.

However, the trade-offs are complex, as we will see.

New relational DBMSs have also been introduced to

provide better horizontal scaling for OLTP, when

compared to traditional RDBMSs. After examining

the NoSQL systems, we will look at these SQL

systems and compare the strengths of the approaches.

The SQL systems strive to provide horizontal

scalability without abandoning SQL and ACID

transactions. We will discuss the trade-offs here.

In this paper, we will refer to both the new SQL and

NoSQL systems as data stores, since the term

“database system” is widely used to refer to traditional

DBMSs. However, we will still use the term

“database” to refer to the stored data in these systems.

All of the data stores have some administrative unit

that you would call a database: data may be stored in

one file, or in a directory, or via some other

mechanism that defines the scope of data used by a

group of applications. Each database is an island unto

itself, even if the database is partitioned and distributed

over multiple machines: there is no “federated

database” concept in these systems (as with some

relational and object-oriented databases), allowing

multiple separately-administered databases to appear

as one. Most of the systems allow horizontal

partitioning of data, storing records on different servers

according to some key; this is called “sharding”. Some

of the systems also allow vertical partitioning, where

parts of a single record are stored on different servers.

1.1 Scope of this Paper

Before proceeding, some clarification is needed in

defining “horizontal scalability” and “simple

operations”. These define the focus of this paper.

By “simple operations”, we refer to key lookups, reads

and writes of one record or a small number of records.

This is in contrast to complex queries or joins, read-

mostly access, or other application loads. With the

advent of the web, especially Web 2.0 sites where

millions of users may both read and write data,

scalability for simple database operations has become

more important. For example, applications may search

and update multi-server databases of electronic mail,

personal profiles, web postings, wikis, customer

records, online dating records, classified ads, and many

other kinds of data. These all generally fit the

definition of “simple operation” applications: reading

or writing a small number of related records in each

operation.

The term “horizontal scalability” means the ability to

distribute both the data and the load of these simple

operations over many servers, with no RAM or disk

shared among the servers. Horizontal scaling differs

from “vertical” scaling, where a database system

utilizes many cores and/or CPUs that share RAM and

disks. Some of the systems we describe provide both

vertical and horizontal scalability, and the effective use

of multiple cores is important, but our main focus is on

horizontal scalability, because the number of cores that

can share memory is limited, and horizontal scaling

generally proves less expensive, using commodity

servers. Note that horizontal and vertical partitioning

are not related to horizontal and vertical scaling,

except that they are both useful for horizontal scaling.

1.2 Systems Beyond our Scope

Some authors have used a broad definition of NoSQL,

including any database system that is not relational.

Specifically, they include:

• Graph database systems: Neo4j and OrientDB

provide efficient distributed storage and queries of

a graph of nodes with references among them.

• Object-oriented database systems: Object-oriented

DBMSs (e.g., Versant) also provide efficient

distributed storage of a graph of objects, and

materialize these objects as programming

language objects.

• Distributed object-oriented stores: Very similar to

object-oriented DBMSs, systems such as GemFire

distribute object graphs in-memory on multiple

servers.

These systems are a good choice for applications that

must do fast and extensive reference-following,

especially where data fits in memory. Programming

language integration is also valuable. Unlike the

NoSQL systems, these systems generally provide

ACID transactions. Many of them provide horizontal

scaling for reference-following and distributed query

decomposition, as well. Due to space limitations,

however, we have omitted these systems from our

comparisons. The applications and the necessary

optimizations for scaling for these systems differ from

the systems we cover here, where key lookups and

simple operations predominate over reference-

following and complex object behavior. It is possible

these systems can scale on simple operations as well,

but that is a topic for a future paper, and proof through

benchmarks.

Data warehousing database systems provide horizontal

scaling, but are also beyond the scope of this paper.

Data warehousing applications are different in

important ways:

• They perform complex queries that collect and

join information from many different tables.

• The ratio of reads to writes is high: that is, the

database is read-only or read-mostly.

There are existing systems for data warehousing that

scale well horizontally. Because the data is

infrequently updated, it is possible to organize or

replicate the database in ways that make scaling

possible.

1.3 Data Model Terminology

Unlike relational (SQL) DBMSs, the terminology used

by NoSQL data stores is often inconsistent. For the

purposes of this paper, we need a consistent way to

compare the data models and functionality.

All of the systems described here provide a way to

store scalar values, like numbers and strings, as well as

BLOBs. Some of them also provide a way to store

more complex nested or reference values. The systems

all store sets of attribute-value pairs, but use different

data structures, specifically:

• A “tuple” is a row in a relational table, where

attribute names are pre-defined in a schema, and

the values must be scalar. The values are

referenced by attribute name, as opposed to an

array or list, where they are referenced by ordinal

position.

• A “document” allows values to be nested

documents or lists as well as scalar values, and the

attribute names are dynamically defined for each

document at runtime. A document differs from a

tuple in that the attributes are not defined in a

global schema, and this wider range of values are

permitted.

• An “extensible record” is a hybrid between a tuple

and a document, where families of attributes are

defined in a schema, but new attributes can be

added (within an attribute family) on a per-record

basis. Attributes may be list-valued.

• An “object” is analogous to an object in

programming languages, but without the

procedural methods. Values may be references or

nested objects.

1.4 Data Store Categories

In this paper, the data stores are grouped according to

their data model:

• Key-value Stores: These systems store values and

an index to find them, based on a programmer-

defined key.

• Document Stores: These systems store documents,

as just defined. The documents are indexed and a

simple query mechanism is provided.

• Extensible Record Stores: These systems store

extensible records that can be partitioned

vertically and horizontally across nodes. Some

papers call these “wide column stores”.

• Relational Databases: These systems store (and

index and query) tuples. The new RDBMSs that

provide horizontal scaling are covered in this

paper.

Data stores in these four categories are covered in the

next four sections, respectively. We will then

summarize and compare the systems.

2. KEY-VALUE STORES

The simplest data stores use a data model similar to the

popular memcached distributed in-memory cache, with

a single key-value index for all the data. We’ll call

these systems key-value stores. Unlike memcached,

these systems generally provide a persistence

mechanism and additional functionality as well:

replication, versioning, locking, transactions, sorting,

and/or other features. The client interface provides

inserts, deletes, and index lookups. Like memcached,

none of these systems offer secondary indices or keys.

2.1 Project Voldemort

Project Voldemort is an advanced key-value store,

written in Java. It is open source, with substantial

contributions from LinkedIn. Voldemort provides

multi-version concurrency control (MVCC) for

updates. It updates replicas asynchronously, so it does

not guarantee consistent data. However, it can

guarantee an up-to-date view if you read a majority of

replicas.

Voldemort supports optimistic locking for consistent

multi-record updates: if updates conflict with any other

process, they can be backed out. Vector clocks, as

used in Dynamo [3], provide an ordering on versions.

You can also specify which version you want to

update, for the put and delete operations.

Voldemort supports automatic sharding of data.

Consistent hashing is used to distribute data around a

ring of nodes: data hashed to node K is replicated on

node K+1 … K+n where n is the desired number of

extra copies (often n=1). Using good sharding

technique, there should be many more “virtual” nodes

than physical nodes (servers). Once data partitioning

is set up, its operation is transparent. Nodes can be

added or removed from a database cluster, and the

system adapts automatically. Voldemort automatically

detects and recovers failed nodes.

Voldemort can store data in RAM, but it also permits

plugging in a storage engine. In particular, it supports

a Berkeley DB and Random Access File storage

engine. Voldemort supports lists and records in

addition to simple scalar values.

2.2 Riak

Riak is written in Erlang. It was open-sourced by

Basho in mid-2009. Basho alternately describes Riak

as a “key-value store” and “document store”. We will

categorize it as an advanced key-value store here,

because it lacks important features of document stores,

but it (and Voldemort) have more functionality than

the other key-value stores:

• Riak objects can be fetched and stored in JSON

format, and thus can have multiple fields (like

documents), and objects can be grouped into

buckets, like the collections supported by

document stores, with allowed/required fields

defined on a per-bucket basis.

• Riak does not support indices on any fields except

the primary key. The only thing you can do with

the non-primary fields is fetch and store them as

part of a JSON object. Riak lacks the query

mechanisms of the document stores; the only

lookup you can do is on primary key.

Riak supports replication of objects and sharding by

hashing on the primary key. It allows replica values to

be temporarily inconsistent. Consistency is tunable by

specifying how many replicas (on different nodes)

must respond for a successful read and how many must

respond for a successful write. This is per-read and

per-write, so different parts of an application can

choose different trade-offs.

Like Voldemort, Riak uses a derivative of MVCC

where vector clocks are assigned when values are

updated. Vector clocks can be used to determine when

objects are direct descendents of each other or a

common parent, so Riak can often self-repair data that

it discovers to be out of sync.

The Riak architecture is symmetric and simple. Like

Voldemort, it uses consistent hashing. There is no

distinguished node to track status of the system: the

nodes use a gossip protocol to track who is alive and

who has which data, and any node may service a client

request. Riak also includes a map/reduce mechanism

to split work over all the nodes in a cluster.

The client interface to Riak is based on RESTful HTTP

requests. REST (REpresentational State Transfer) uses

uniform, stateless, cacheable, client-server calls. There

is also a programmatic interface for Erlang, Java, and

other languages.

The storage part of Riak is “pluggable”: the key-value

pairs may be in memory, in ETS tables, in DETS

tables, or in Osmos tables. ETS, DETS, and Osmos

tables are all implemented in Erlang, with different

performance and properties.

One unique feature of Riak is that it can store “links”

between objects (documents), for example to link

objects for authors to the objects for the books they

wrote. Links reduce the need for secondary indices,

but there is still no way to do range queries.

Here’s an example of a Riak object described in JSON:

{

"bucket":"customers",

"key":"12345",

"object":{

"name":"Mr. Smith",

"phone":”415-555-6524” }

"links":[

["sales","Mr. Salesguy","salesrep"],

["cust-orders","12345","orders"] ]

"vclock":"opaque-riak-vclock",

"lastmod":"Mon, 03 Aug 2009 18:49:42 GMT"

}

Note that the primary key is distinguished, while other

fields are part of an “object” portion. Also note that

the bucket, vector clock, and modification date is

specified as part of the object, and links to other

objects are supported.

2.3 Redis

The Redis key-value data store started as a one-person

project but now has multiple contributors as BSD-

licensed open source. It is written in C.

A Redis server is accessed by a wire protocol

implemented in various client libraries (which must be

updated when the protocol changes). The client side

does the distributed hashing over servers. The servers

store data in RAM, but data can be copied to disk for

backup or system shutdown. System shutdown may be

needed to add more nodes.

Like the other key-value stores, Redis implements

insert, delete and lookup operations. Like Voldemort,

it allows lists and sets to be associated with a key, not

just a blob or string. It also includes list and set

operations.

Redis does atomic updates by locking, and does

asynchronous replication. It is reported to support

about 100K gets/sets per second on an 8-core server.

2.4 Scalaris

Scalaris is functionally similar to Redis. It was written

in Erlang at the Zuse Institute in Berlin, and is open

source. In distributing data over nodes, it allows key

ranges to be assigned to nodes, rather than simply

hashing to nodes. This means that a query on a range

of values does not need to go to every node, and it also

may allow better load balancing, depending on key

distribution.

Like the other key-value stores, it supports insert,

delete, and lookup. It does replication synchronously

(copies must be updated before the operation is

complete) so data is guaranteed to be consistent.

Scalaris also supports transactions with ACID

properties on multiple objects. Data is stored in

memory, but replication and recovery from node

failures provides durability of the updates.

Nevertheless, a multi-node power failure would cause

disastrous loss of data, and the virtual memory limit

sets a maximum database size.

Scalaris reads and writes must go to a majority of the

replicas before an operation completes. Scalaris uses a

ring of nodes, an unusual distribution and replication

strategy that requires log(N) hops to read/write a key-

value pair.

2.5 Tokyo Cabinet

Tokyo Cabinet / Tokyo Tyrant was a sourcefourge.net

project, but is now licensed and maintained by FAL

Labs. Tokyo Cabinet is the back-end server, Tokyo

Tyrant is a client library for remote access. Both are

written in C.

There are six different variations for the Tokyo

Cabinet server: hash indexes in memory or on disk, B-

trees in memory or on disk, fixed-size record tables,

and variable-length record tables. The engines

obviously differ in their performance characteristics,

e.g. the fixed-length records allow quick lookups.

There are slight variations on the API supported by

these engines, but they all support common

get/set/update operations. The documentation is a bit

unclear, but they claim to support locking, ACID

transactions, a binary array data type, and more

complex update operations to atomically update a

number or concatenate to a string. They support

asynchronous replication with dual master or

master/slave. Recovery of a failed node is manual, and

there is no automatic sharding.

2.6 Memcached, Membrain, and

Membase

The memcached open-source distributed in-memory

indexing system has been enhanced by Schooner

Tehnologies and Membase, to include features

analogous to the other key-value stores: persistence,

replication, high availability, dynamic growth, backup,

and so on. Without persistence or replication,

memcached does not really qualify as a “data store”.

However, Membrain and Membase certainly do, and

these systems are also compatible with existing

memcached applications. This compatibility is an

attractive feature, given that memcached is widely

used; memcached users that require more advanced

features can easily upgrade to Membase and

Membrain.

The Membase system is open source, and is supported

by the company Membase. Its most attractive feature

is probably its ability to elastically add or remove

servers in a running system, moving data and

dynamically redirecting requests in the meantime. The

elasticity in most of the other systems is not as

convenient.

Membrain is licensed per server, and is supported by

Schooner Technologies. Its most attractive feature is

probably its excellent tuning for flash memory. The

performance gains of flash memory will not be gained

in other systems by treating flash as a faster hard disk;

it is important that the system treat flash as a true

“third tier”, different from RAM and disk. For

example, many systems have substantial overhead in

buffering and caching hard disk pages; this is

unnecessary overhead with flash. The benchmark

results on Schooner’s web site show many times better

performance than a number of competitors, particularly

when data overflows RAM.

2.7 Summary

All the key-value stores support insert, delete, and

lookup operations. All of these systems provide

scalability through key distribution over nodes.

Voldemort, Riak, Tokyo Cabinet, and enhanced

memcached systems can store data in RAM or on disk,

with storage add-ons. The others store data in RAM,

and provide disk as backup, or rely on replication and

recovery so that a backup is not needed.

Scalaris and enhanced memcached systems use

synchronous replication, the rest use asynchronous.

Scalaris and Tokyo Cabinet implement transactions,

while the others do not.

Voldemort and Riak use multi-version concurrency

control (MVCC), the others use locks.

Membrain and Membase are built on the popular

memcached system, adding persistence, replication,

and other features. Backward compatibility with

memcached give these products an advantage.

3. DOCUMENT STORES

As discussed in the first section, document stores

support more complex data than the key-value stores.

The term “document store” may be confusing: while

these systems could store “documents” in the

traditional sense (articles, Microsoft Word files, etc.), a

document in these systems can be any kind of

“pointerless object”, consistent with our definition in

Section 1. Unlike the key-value stores, these systems

generally support secondary indexes and multiple

types of documents (objects) per database, and nested

documents or lists. Like other NoSQL systems, the

Scalable SQL and NoSQL data stores

Figures

Citations

Big Data: A Survey

The rise of big data on cloud computing

The Internet of Things vision: Key features, applications and open issues

Toward Scalable Systems for Big Data Analytics: A Technology Tutorial

Inter-Cloud architectures and application brokering: taxonomy and survey

References

Bigtable: A Distributed Storage System for Structured Data (Awarded Best Paper!).

Dynamo: amazon's highly available key-value store

Benchmarking cloud serving systems with YCSB

Bigtable: a distributed storage system for structured data

Brewer's conjecture and the feasibility of consistent, available, partition-tolerant web services

Related Papers (5)

Bigtable: A Distributed Storage System for Structured Data

MapReduce: simplified data processing on large clusters

Dynamo: amazon's highly available key-value store

Benchmarking cloud serving systems with YCSB

Cassandra: a decentralized structured storage system

Frequently Asked Questions (9)

Q1. What are the contributions in "Scalable sql and nosql data stores" ?

Q2. What does the author say about the CAP theorem?

Q3. What is the attractive feature of memcached?

Q4. What are the main components of a graph database?

Q5. What is the common use of a consistent hashing technique?

Q6. What is the common data store?

Q7. What is the key feature of NoSQL?

Q8. What is the way to update a replica?

Q9. What does the author mean by the term data stores?