Merge pull request #20 from martinsumner/mas-introduction
Mas introduction
99
README.md
|
@ -1,94 +1,37 @@
|
||||||
# LeveledDB
|
# LevelEd - An Erlang Key-Value store
|
||||||
|
|
||||||
## Overview
|
## Introduction
|
||||||
|
|
||||||
LeveledDB is an experimental Key/Value store based on the Log-Structured Merge Tree concept, written in Erlang. It is not currently suitable for production systems, but is intended to provide a proof of concept of the potential benefits of different design trade-offs in LSM Trees.
|
LevelEd is a work-in-progress prototype of a simple Key-Value store based on the concept of Log-Structured Merge Trees, with the following characteristics:
|
||||||
|
|
||||||
The specific goals of this implementation are:
|
- Optimised for workloads with larger values (e.g. > 4KB).
|
||||||
|
- Supporting of values split between metadata and body, and allowing for HEAD requests which have lower overheads than GET requests.
|
||||||
|
- Support for tagging of object types and the implementation of alternative store behaviour based on type.
|
||||||
|
- Support for low-cost clones without locking to provide for scanning queries, specifically where there is a need to scan across keys and metadata (not values).
|
||||||
|
- Written in Erlang as a message passing system between Actors.
|
||||||
|
|
||||||
- Be simple and straight-forward to understand and extend
|
The store has been developed with a focus on being a potential backend to a Riak KV database, rather than as a generic store. The primary aim of developing (yet another) Riak backend is to examine the potential to reduce the broader costs providing sustained throughput in Riak i.e. to provide equivalent throughput on cheaper hardware. It is also anticipated in having a fully-featured pure Erlang backend may assist in evolving new features through the Riak ecosystem which require end-to-end changes.
|
||||||
- Support objects which have keys, secondary indexes, a value and potentially some metadata which provides a useful subset of the information in the value
|
|
||||||
- Support a HEAD request which has a lower cost than a GET request, so that requests requiring access only to metadata can gain efficiency by saving the full cost of returning the entire value
|
|
||||||
- Tries to reduce write amplification when compared with LevelDB, to reduce disk contention but also make rsync style backup strategies more efficient
|
|
||||||
|
|
||||||
The system context for the store at conception is as a Riak backend store with a complete set of backend capabilities, but one intended to be use with relatively frequent iterators, and values of non-trivial size (e.g. > 4KB).
|
The store is not expected to be 'faster' than leveldb (the primary fully-featured Riak backend available today), but it is likely in some cases to offer improvements in throughput.
|
||||||
|
|
||||||
## Implementation
|
## More Details
|
||||||
|
|
||||||
The store is written in Erlang using the actor model, the primary actors being:
|
For more details on the store:
|
||||||
|
|
||||||
- A Bookie
|
- An [introduction](docs/INTRO.md) to LevelEd covers some context to the factors motivating design trade-offs in the store.
|
||||||
- An Inker
|
|
||||||
- A Penciller
|
|
||||||
- Worker Clerks
|
|
||||||
- File Clerks
|
|
||||||
|
|
||||||
### The Bookie
|
- The [design overview](docs/DESIGN.md) explains the actor model used and the basic flow of requests through the store.
|
||||||
|
|
||||||
The Bookie provides the public interface of the store, liaising with the Inker and the Penciller to resolve requests to put new objects, and fetch those objects. The Bookie keeps a copy of key changes and object metadata associated with recent modifications, but otherwise has no direct access to state within the store. The Bookie can replicate the Penciller and the Inker to provide clones of the store. These clones can be used for querying across the store at a given snapshot.
|
- [Future work](docs/FUTURE.md) covers new features being implemented at present, and improvements necessary to make the system production ready.
|
||||||
|
|
||||||
### The Inker
|
- There is also a ["Why"](docs/WHY.md) section looking at lower level design choices and the rationale that supports them.
|
||||||
|
|
||||||
The Inker is responsible for keeping the Journal of all changes which have been made to the store, with new writes being append to the end of the latest journal file. The Journal is an ordered log of activity by sequence number.
|
## Is this interesting?
|
||||||
|
|
||||||
Changes to the store should be acknowledged if and only if they have been persisted to the Journal. The Inker can find a value in the Journal through a manifest which provides a map between sequence numbers and Journal files. The Inker can only efficiently find a value in the store if the sequence number is known, and so the sequence number is always part of the metadata maintained by the Penciller in the Ledger.
|
At the initiation of the project I accepted that making a positive contribution to this space is hard - given the superior brainpower and experience of those that have contributed to the KV store problem space in general, and the Riak backend space in particular.
|
||||||
|
|
||||||
The Inker can also scan the Journal from a particular sequence number, for example to recover the Penciller's lost in-memory state following a shutdown.
|
The target at inception was to do something interesting, something that articulates through working software the potential for improvement to exist by re-thinking certain key assumptions and trade-offs.
|
||||||
|
|
||||||
### The Penciller
|
[Initial volume tests](docs/VOLUME.md) indicate that it is at least interesting, with substantial improvements in both throughput (73%) and tail latency (1:20) when compared to eleveldb - in scenarios expected to be optimised for leveled. Note, to be clear, this is statement falls well short of making a general claim that it represents a 'better' Riak backend than leveldb.
|
||||||
|
|
||||||
The Penciller is responsible for maintaining a Ledger of Keys, Index entries and Metadata (including the sequence number) that represent a near-real-time view of the contents of the store. The Ledger is a merge tree ordered into Levels of exponentially increasing size, with each level being ordered across files and within files by Key. Get requests are handled by checking each level in turn - from the top (Level 0), to the basement (up to Level 8). The first match for a given key is the returned answer.
|
More information can be found in the [volume testing section](docs/VOLUME.md).
|
||||||
|
|
||||||
Changes ripple down the levels in batches and require frequent rewriting of files, in particular at higher levels. As the Ledger does not contain the full object values, this write amplification associated with the flow down the levels is limited to the size of the key and metadata.
|
|
||||||
|
|
||||||
The Penciller keeps an in-memory view of new changes that have yet to be persisted in the Ledger, and at startup can request the Inker to replay any missing changes by scanning the Journal.
|
|
||||||
|
|
||||||
### Worker Clerks
|
|
||||||
|
|
||||||
Both the Inker and the Penciller must undertake compaction work. The Inker must garbage collect replaced or deleted objects form the Journal. The Penciller must merge files down the tree to free-up capacity for new writes at the top of the Ledger.
|
|
||||||
|
|
||||||
Both the Penciller and the Inker each make use of their own dedicated clerk for completing this work. The Clerk will add all new files necessary to represent the new view of that part of the store, and then update the Inker/Penciller with the new manifest that represents that view. Once the update has been acknowledged, any removed files can be marked as delete_pending, and they will poll the Inker (if a Journal file) or Penciller (if a Ledger file) for it to confirm that no users of the system still depend on the old snapshot of the store to be maintained.
|
|
||||||
|
|
||||||
### File Clerks
|
|
||||||
|
|
||||||
Every file within the store has is owned by its own dedicated process (modelled as a finite state machine). Files are never created or accessed by the Inker or the Penciller, interactions with the files are managed through messages sent to the File Clerk processes which own the files.
|
|
||||||
|
|
||||||
The File Clerks themselves are ignorant to their context within the store. For example a file in the Ledger does not know what level of the Tree it resides in. The state of the store is represented by the Manifest which maintains a picture of the store, and contains the process IDs of the file clerks which represent the files.
|
|
||||||
|
|
||||||
Cloning of the store does not require any file-system level activity - a clone simply needs to know the manifest so that it can independently make requests of the File Clerk processes, and register itself with the Inker/Penciller so that those files are not deleted whilst the clone is active.
|
|
||||||
|
|
||||||
The Journal files use a constant database format almost exactly replicating the CDB format originally designed by DJ Bernstein. The Ledger files use a bespoke format with is based on Google's SST format, with the primary difference being that the bloom filters used to protect against unnecessary lookups are based on the Riak Segment IDs of the key, and use single-hash rice-encoded sets rather using the traditional bloom filter size-optimisation model of extending the number of hashes used to reduce the false-positive rate.
|
|
||||||
|
|
||||||
File clerks spend a short initial portion of their life in a writable state. Once they have left a writing state, they will for the remainder of their life-cycle, be in an immutable read-only state.
|
|
||||||
|
|
||||||
## Paths
|
|
||||||
|
|
||||||
The PUT path for new objects and object changes depends on the Bookie interacting with the Inker to ensure that the change has been persisted with the Journal, the Ledger is updated in batches after the PUT has been completed.
|
|
||||||
|
|
||||||
The HEAD path needs the Bookie to look in his cache of recent Ledger changes, and if the change is not present consult with the Penciller.
|
|
||||||
|
|
||||||
The GET path follows the HEAD path, but once the sequence number has been determined through the response from the Ledger the object itself is fetched from the journal via the Inker.
|
|
||||||
|
|
||||||
All other queries (folds over indexes, keys and objects) are managed by cloning either the Penciller, or the Penciller and the Inker.
|
|
||||||
|
|
||||||
## Trade-Offs
|
|
||||||
|
|
||||||
Further information of specific design trade-off decisions is provided:
|
|
||||||
|
|
||||||
- What is a log-structured merge tree?
|
|
||||||
- Memory management
|
|
||||||
- Backup and Recovery
|
|
||||||
- The Penciller memory
|
|
||||||
- File formats
|
|
||||||
- Stalling, pausing and back-pressure
|
|
||||||
- Riak Anti-Entropy
|
|
||||||
- Riak and HEAD requests
|
|
||||||
- Riak and alternative queries
|
|
||||||
|
|
||||||
## Naming Things is Hard
|
|
||||||
|
|
||||||
The naming of actors within the model is very loosely based on the slang associated with an on-course Bookmaker.
|
|
||||||
|
|
||||||
## Learning
|
|
||||||
|
|
||||||
The project was started in part as a learning exercise. This is my first Erlang project, and has been used to try and familiarise myself with Erlang concepts. However, there are undoubtedly many lessons still to be learned about how to write good Erlang OTP applications.
|
|
78
docs/DESIGN.md
Normal file
|
@ -0,0 +1,78 @@
|
||||||
|
## Design
|
||||||
|
|
||||||
|
The store is written in Erlang using the actor model, the primary actors being:
|
||||||
|
|
||||||
|
- A Bookie
|
||||||
|
|
||||||
|
- An Inker
|
||||||
|
|
||||||
|
- A Penciller
|
||||||
|
|
||||||
|
- Worker Clerks
|
||||||
|
|
||||||
|
- File Clerks
|
||||||
|
|
||||||
|
### The Bookie
|
||||||
|
|
||||||
|
The Bookie provides the public interface of the store, liaising with the Inker and the Penciller to resolve requests to put new objects, and fetch those objects. The Bookie keeps a copy of key changes and object metadata associated with recent modifications, but otherwise has no direct access to state within the store. The Bookie can replicate the Penciller and the Inker to provide clones of the store. These clones can be used for querying across the store at a given snapshot.
|
||||||
|
|
||||||
|

|
||||||
|
|
||||||
|
### The Inker
|
||||||
|
|
||||||
|
The Inker is responsible for keeping the Journal of all changes which have been made to the store, with new writes being append to the end of the latest journal file. The Journal is an ordered log of activity by sequence number.
|
||||||
|
|
||||||
|
Changes to the store should be acknowledged if and only if they have been persisted to the Journal. The Inker can find a value in the Journal through a manifest which provides a map between sequence numbers and Journal files. The Inker can only efficiently find a value in the store if the sequence number is known.
|
||||||
|
|
||||||
|
The Inker can also scan the Journal from a particular sequence number, for example to recover in-memory state within the store following a shutdown.
|
||||||
|
|
||||||
|

|
||||||
|
|
||||||
|
### The Penciller
|
||||||
|
|
||||||
|
The Penciller is responsible for maintaining a Ledger of Keys, Index entries and Metadata (including the sequence number) that represent a near-real-time view of the contents of the store. The Ledger is a merge tree ordered into Levels of exponentially increasing size, with each level being ordered across files and within files by Key. Get requests are handled by checking each level in turn - from the top (Level 0), to the basement (up to Level 8). The first match for a given key is the returned answer.
|
||||||
|
|
||||||
|
Changes ripple down the levels in batches and require frequent rewriting of files, in particular at higher levels. As the Ledger does not contain the full object values, this write amplification associated with the flow down the levels is limited to the size of the key and metadata.
|
||||||
|
|
||||||
|
The Penciller keeps an in-memory view of new changes that have yet to be persisted in the Ledger, and at startup can request the Inker to replay any missing changes by scanning the Journal.
|
||||||
|
|
||||||
|

|
||||||
|
|
||||||
|
### Worker Clerks
|
||||||
|
|
||||||
|
Both the Inker and the Penciller must undertake compaction work. The Inker must garbage collect replaced or deleted objects form the Journal. The Penciller must merge files down the tree to free-up capacity for new writes at the top of the Ledger.
|
||||||
|
|
||||||
|
Both the Penciller and the Inker each make use of their own dedicated clerk for completing this work. The Clerk will add all new files necessary to represent the new view of that part of the store, and then update the Inker/Penciller with the new manifest that represents that view. Once the update has been acknowledged, any removed files can be marked as delete_pending, and they will poll the Inker (if a Journal file) or Penciller (if a Ledger file) for it to confirm that no clones of the system still depend on the old snapshot of the store to be maintained.
|
||||||
|
|
||||||
|
### File Clerks
|
||||||
|
|
||||||
|
Every file within the store has is owned by its own dedicated process (modelled as a finite state machine). Files are never created or accessed by the Inker or the Penciller, interactions with the files are managed through messages sent to the File Clerk processes which own the files.
|
||||||
|
|
||||||
|
The File Clerks themselves are ignorant to their context within the store. For example a file in the Ledger does not know what level of the Tree it resides in. The state of the store is represented by the Manifest which maintains a picture of the store, and contains the process IDs of the file clerks which represent the files.
|
||||||
|
|
||||||
|
Cloning of the store does not require any file-system level activity - a clone simply needs to know the manifest so that it can independently make requests of the File Clerk processes, and register itself with the Inker/Penciller so that those files are not deleted whilst the clone is active.
|
||||||
|
|
||||||
|
The Journal files use a constant database format almost exactly replicating the CDB format originally designed by DJ Bernstein. The Ledger files use a bespoke format with is based on Google's SST format.
|
||||||
|
|
||||||
|
File clerks spend a short initial portion of their life in a writable state. Once they have left a writing state, they will for the remainder of their life-cycle, be in an immutable read-only state.
|
||||||
|
|
||||||
|
## Clones
|
||||||
|
|
||||||
|
Both the Penciller and the Inker can be cloned, to provide a snapshot of the database at a point in time for long-running process that can be run concurrently to other database actions. Clones are used for Journal compaction, but also for scanning queries in the penciller (for example to support 2i queries or hashtree rebuilds in Riak).
|
||||||
|
|
||||||
|
The snapshot process is simple. The necessary loop-state is requested from the real worker, in particular the manifest and any immutable in-memory caches, and a new gen_server work is started with the loop state. The clone registers itself as a snapshot with the real worker, with a timeout that will allow the snapshot to expire if the clone silently terminates. The clone will then perform its work, making requests to the file workers referred to in the manifest. Once the work is complete the clone should remove itself from the snapshot register in the real worker before closing.
|
||||||
|
|
||||||
|
The snapshot register is used by the real worker when file workers are placed in the delete_pending state after they have been replaced in the current manifest. Checking the list of registered snapshots allows the Penciller or Inker to inform the File Clerk if they remove themselves permanently - as no remaining clones may expect them to be present (except those who have timed out).
|
||||||
|
|
||||||
|
Managing the system in this way requires that ETS tables are used sparingly for holding in-memory state, and immutable and hence shareable objects are used instead. The exception to the rule is the small in-memory state of recent additions kept by the Bookie - which must be exported to a list on every snapshot request.
|
||||||
|
|
||||||
|
## Paths
|
||||||
|
|
||||||
|
The PUT path for new objects and object changes depends on the Bookie interacting with the Inker to ensure that the change has been persisted with the Journal, the Ledger is updated in batches after the PUT has been completed.
|
||||||
|
|
||||||
|
The HEAD path needs the Bookie to look in his cache of recent Ledger changes, and if the change is not present consult with the Penciller.
|
||||||
|
|
||||||
|
The GET path follows the HEAD path, but once the sequence number has been determined through the response from the Ledger the object itself is fetched from the journal via the Inker.
|
||||||
|
|
||||||
|
All other queries (folds over indexes, keys and objects) are managed by cloning either the Penciller, or the Penciller and the Inker.
|
||||||
|
|
32
docs/FUTURE.md
Normal file
|
@ -0,0 +1,32 @@
|
||||||
|
## Further Features
|
||||||
|
|
||||||
|
The store supports all the required Riak backend capabilities. A number of further features are either partially included, in progress or under consideration:
|
||||||
|
|
||||||
|
- Support for HEAD operations, and changes to GET/PUT Riak paths to use HEAD requests where GET requests would be unnecessary.
|
||||||
|
|
||||||
|
- Support for "Tags" in keys to allow for different treatment of different key types (i.e. changes in merge actions, metadata retention, caching for lookups).
|
||||||
|
|
||||||
|
- A fast "list bucket" capability, borrowing from the HanoiDB implementation, to allow for safe bucket listing in production.
|
||||||
|
|
||||||
|
- A bucket size query, which requires traversal only of the Ledger and counts the keys and sums he total on-disk size of the objects within the bucket.
|
||||||
|
|
||||||
|
- Support for a specific Riak tombstone tag where reaping of tombstones can be deferred (by many days) i.e. so that a 'sort-of-keep' deletion strategy can be followed that will eventually garbage collect without the need to hold pending full deletion state in memory.
|
||||||
|
|
||||||
|
|
||||||
|
## Outstanding work
|
||||||
|
|
||||||
|
There is some work required before LevelEd could be considered production ready:
|
||||||
|
|
||||||
|
- A strategy for the supervision and restart of processes, in particular for clerks.
|
||||||
|
|
||||||
|
- Further functional testing within the context of Riak.
|
||||||
|
|
||||||
|
- Introduction of property-based testing.
|
||||||
|
|
||||||
|
- Riak modifications to support the optimised Key/Clock scanning for hashtree rebuilds.
|
||||||
|
|
||||||
|
- Amend compaction scheduling to ensure that all vnodes do not try to concurrently compact during a single window.
|
||||||
|
|
||||||
|
- Improved handling of corrupted files.
|
||||||
|
|
||||||
|
- A way of identifying the partition in each log to ease the difficulty of tracing activity when multiple stores are run in parallel.
|
104
docs/INTRO.md
Normal file
|
@ -0,0 +1,104 @@
|
||||||
|
# LevelEd - A Log-Structured Merge Tree
|
||||||
|
|
||||||
|
The following section is a brief overview of some of the motivating factors behind developing LevelEd.
|
||||||
|
|
||||||
|
## A Paper To Love
|
||||||
|
|
||||||
|
The concept of a Log Structured Merge Tree is described within the 1996 paper ["The Log Structured Merge Tree"](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.44.2782&rep=rep1&type=pdf) by Patrick O'Neil et al. The paper is not specific on precisely how a LSM-Tree should be implemented, proposing a series of potential options. The paper's focus is on framing the justification for design decisions in the context of hardware economics.
|
||||||
|
|
||||||
|
The hardware economics at the time of the paper were:
|
||||||
|
|
||||||
|
- COST<sub>d</sub> = cost of 1MByte of disk storage = $1
|
||||||
|
|
||||||
|
- COST<sub>m</sub> = cost of 1MByte of memory storage = $100
|
||||||
|
|
||||||
|
- COST<sub>P</sub> = cost of 1 page/second IO rate random pages = $25
|
||||||
|
|
||||||
|
- COST<sub>pi</sub> = cost of 1 page/second IO rate multi-page block = $2.50
|
||||||
|
|
||||||
|
Consideration on design trade-offs for LevelEd should likewise start with a viewpoint on the modern cost of hardware - in the context where we expect the store to be run (i.e. as multiple stores co-existing on a server, sharing workload across multiple servers).
|
||||||
|
|
||||||
|
### Modern Hardware Costs
|
||||||
|
|
||||||
|
Based on the experience of running Riak at scale in production for the NHS, what has been noticeable is that although on the servers used around 50% of the cost goes on disk-related expenditure, in almost all scenarios where the system is pushed to limits the limit first hit is in disk throughput.
|
||||||
|
|
||||||
|
The purchase costs of disk though, do not accurately reflect the running costs of disks - because disks fail, and fail often. Also the hardware choices made by the NHS for the Spine programme, do not necessarily reflect the generic industry choices and costs.
|
||||||
|
|
||||||
|
To get an up-to-date objective measure of the overall exists are, assuming data centres are run with a high degree of efficiency and savings of scale - the price list for Amazon EC2 instances can assist, by directly comparing the current prices of different instances.
|
||||||
|
|
||||||
|
As at 26th January 2017, the [on-demand instance pricing](https://aws.amazon.com/ec2/pricing/on-demand/) for servers is:
|
||||||
|
|
||||||
|
- c4.4xlarge - 16 CPU, 30 GB RAM, EBS only - $0.796
|
||||||
|
|
||||||
|
- r4.xlarge - 4 CPU, 30GB RAM, EBS only - $0.266
|
||||||
|
|
||||||
|
- r4.4xlarge - 16 CPU, 122 GB RAM, EBS only - $1.064
|
||||||
|
|
||||||
|
- i2.4xlarge - 16 CPU, 122 GB RAM, 4 X 800 GB SSD - $3.41
|
||||||
|
|
||||||
|
- d2.4xlarge - 16 CPU, 122 GB RAM, 12 X 2000 GB HDD - $2.76
|
||||||
|
|
||||||
|
By comparing these prices we can surmise that the relative costs are:
|
||||||
|
|
||||||
|
- 1 CPU - $0.044 per hour
|
||||||
|
|
||||||
|
- 1GB RAM - $0.0029 per hour
|
||||||
|
|
||||||
|
- 1GB SDD - $0.0015 per hour (assumes mirroring)
|
||||||
|
|
||||||
|
- 1GB HDD - $0.00015 per hour (assumes mirroring)
|
||||||
|
|
||||||
|
If a natural ratio in a database server is 1 CPU: 10GB RAM: 200GB disk - this would give a proportional cost of the disk of 80% for SSD and 25% for HDD.
|
||||||
|
|
||||||
|
Compared to the figures at the time of the LSM-Tree paper, the actual delta in the per-byte cost of memory and the per-byte costs of disk space has closed significantly, even when using the lowest cost disk options. This may reflect changes in the pace of technology advancement, or just the fact that maintenance cost associated with different failure rates is now more correctly priced.
|
||||||
|
|
||||||
|
The availability of SDDs is not a silver bullet to disk i/o problems when cost is considered, as although they eliminate the additional costs of random page access through the removal of the disk head movement overhead (of about 6.5ms per shift), this benefit is at an order of magnitude difference in cost compared to spinning disks, and at a cost greater than half the price of DRAM. SSDs have not taken the problem of managing the overheads of disk persistence away, they've simply added another dimension to the economic profiling problem.
|
||||||
|
|
||||||
|
In physical on-premise server environments there is also commonly the cost of disk controllers. Disk controllers bend the economics of persistence through the presence of flash-backed write caches. However, disk controllers also fail - within the NHS environment disk controller failures are the second most common device failure after individual disks. Failures of disk controllers are also expensive to resolve, not being hot-pluggable like disks, and carrying greater risk of node data-loss due to either bad luck or bad process during the change. It is noticeable that EC2 does not have disk controllers and given their failure rate and cost of recovery, this appears to be a sensible trade-off.
|
||||||
|
|
||||||
|
Making cost-driven decisions about storage design remains as relevant now as it was two decades ago when the LSM-Tree paper was published, especially as we can now directly see those costs reflected in hourly resource charges.
|
||||||
|
|
||||||
|
### eleveldb Evolution
|
||||||
|
|
||||||
|
The evolution of leveledb in Riak, from the original Google-provided store to the significantly refactored eleveldb provided by Basho reflects the alternative hardware economics that the stores were designed for.
|
||||||
|
|
||||||
|
The original leveledb considered in part the hardware economics of the phone where there are clear constraints around CPU usage - due to both form-factor and battery life, and where disk space may be at a greater premium than disk IOPS. Some of the evolution of eleveldb is down to the Riak-specific problem of needing to run multiple stores on a single server, where even load distribution may lead to a synchronisation of activity. Much of the evolution is also about how to make better use of the continuous availability of CPU resource, in the face of the relative scarcity of disk resource. Changes such as overlapping files at level 1, hot threads, compression improvements etc all move eleveldb in the direction of being easier on disk at the cost of CPU; and the hardware economics of servers would indicate this is a wise choice
|
||||||
|
|
||||||
|
### Planning for LevelEd
|
||||||
|
|
||||||
|
The primary design differentiation between LevelEd and LevelDB is the separation of the key store (known as the Ledger in LevelEd) and the value store (known as the journal). The Journal is like a continuous extension of the nursery log within LevelDB, only with a gradual evolution into [CDB files](https://en.wikipedia.org/wiki/Cdb_(software)) so that file offset pointers are not required to exist permanently in memory. The Ledger is a merge tree structure, with values substituted with metadata and a sequence number - where the sequence number can be used to find the value in the Journal.
|
||||||
|
|
||||||
|
This is not an original idea, the LSM-Tree paper specifically talked about the trade-offs of placing identifiers rather than values in the merge tree:
|
||||||
|
|
||||||
|
> To begin with, it should be clear that the LSM-tree entries could themselves contain records rather than RIDs pointing to records elsewhere on disk. This means that the records themselves can be clustered by their keyvalue. The cost for this is larger entries and a concomitant acceleration of the rate of insert R in bytes per second and therefore of cursor movement and total I/O rate H.
|
||||||
|
|
||||||
|
The reasoning behind the use of this structure is an attempt to differentiate more clearly between a (small) hot database space (the Ledger) and a (much larger) cold database space (the non-current part of the Journal) so that through use of the page cache, or faster disk, the hot part of the database can be optimised for rapid access.
|
||||||
|
|
||||||
|
In parallel to this work, there has also been work published on [WiscKey](https://www.usenix.org/system/files/conference/fast16/fast16-papers-lu.pdf) which explores precisely this trade-off.
|
||||||
|
|
||||||
|
There is an additional optimisation that is directly relevant to Riak. Riak always fetches both Key and Value as a GET operation within a cluster, but theoretically in many cases, where the object is not a sibling, this is not necessary. For example when GETting an object to perform a PUT, only the vector clock and index specs are actually necessary if the object is not a CRDT. Also when performing a Riak GET operation the value is fetched three times, even if there is no conflict between the values.
|
||||||
|
|
||||||
|
So the hypothesis that separating Keys and Values may be optimal for LSM-Trees in general is potentially extendable for Riak, where there exists the potential in the majority of read requests to replace the GET of a value with a lower cost HEAD request for just Key and Metadata.
|
||||||
|
|
||||||
|
## Being Operator Friendly
|
||||||
|
|
||||||
|
The LSM-Tree paper focuses on hardware trade-offs in database design. LevelEd is focused on the job of being a backend to a Riak database, and the Riak database is opinionated on the trade-off between developer and operator productivity. Running a Riak database imposes constraints and demands on developers - there are things the developer needs to think hard about: living without transactions, considering the resolution of siblings, manual modelling for query optimisation.
|
||||||
|
|
||||||
|
However, in return for this pain there is great reward, a reward which is gifted to the operators of the service. Riak clusters are reliable and predictable, and operational processes are slim and straight forward - preparation for managing a Riak cluster in production needn't go much beyond rehearsing magical cure-alls of the the node stop/start and node join/leave processes. At the NHS, where we have more than 50 Riak nodes in 24 by 365 business critical operations, it is not untypical to go more than 28-days without anyone logging on to a database node. This is a relief for those of us who have previously lived in the world with databases with endless configuration parameters to test or blame for issues, where you always seem to be the unlucky one who suffer the outages "never seen in any other customer", where the databases come with ever more complicated infrastructure dependencies and and where DBAs need to be constantly at-hand to analyse reports, kill rogue queries and re-run the query optimiser as an answer to the latest 'blip' in performance.
|
||||||
|
|
||||||
|
Developments on Riak of the past few years, in particular the introduction of CRDTs, have made some limited progress in easing the developer headaches. No real progress has been made though in making things more operator friendly, and although operator sleep patterns are the primary beneficiary of a Riak installation, that does not mean to say that things cannot be improved.
|
||||||
|
|
||||||
|
### Planning For LevelEd
|
||||||
|
|
||||||
|
The primary operator improvements sought are:
|
||||||
|
|
||||||
|
- Increased visibility of database contents. Riak lacks efficient answers to simple questions about the bucket names which have been defined and the size and space consumed by different buckets.
|
||||||
|
|
||||||
|
- Reduced variation. Riak has unscheduled events, in particular active anti-entropy hashtree rebuilds, that will temporarily impact the performance of the cluster both during the event (due to resource pressure of the actual rebuild) and immediately after (e.g. due to page cache pollution). The intention is to convert object-scanning events, into events which require only Key/Metadata scanning.
|
||||||
|
|
||||||
|
- More predictable capacity management. Production systems constrained by disk throughput are hard to monitor for capacity, especially as there is no readily available measure of disk utilisation (note the often missed warning in the iostat man page - 'Device saturation occurs when this value is close to 100% for devices serving requests serially. But for devices serving requests in parallel, such as RAID arrays and modern SSDs, this number does not reflect their performance limits'). Monitoring disk-bound *nix-based systems requires the monitoring of volatile late-indicators of issues (e.g. await times), and this can be exacerbated by volatile demands on disk (e.g. due to compaction), and the ongoing risk of either individual disk failure or the overhead of individual disk recovery.
|
||||||
|
|
||||||
|
- Code context switching. Looking at code can be helpful, having the majority of the code in Erlang and then a portion of the code in C++, makes understanding and monitoring behaviour within the database that bit harder.
|
||||||
|
|
||||||
|
- Backup space issues. Managing the space consumed by backups, and the time required to produce backups, is traditionally handled through the use of delta-based backup systems such as rsync. However, such systems are far less effective if the system to be backed up has write amplification by design.
|
||||||
|
|
94
docs/VOLUME.md
Normal file
|
@ -0,0 +1,94 @@
|
||||||
|
# Volume Testing
|
||||||
|
|
||||||
|
## Parallel Node Testing
|
||||||
|
|
||||||
|
Initial volume tests have been [based on standard basho_bench eleveldb test](../test/volume/single_node/examples) to run multiple stores in parallel on the same node and and subjecting them to concurrent pressure.
|
||||||
|
|
||||||
|
This showed a [relative positive performance for leveled](VOLUME_PRERIAK.md) for both population and load. This also showed that although the LevelEd throughput was relatively stable it was still subject to fluctuations related to CPU constraints. Prior to moving on to full Riak testing, a number of changes where then made to LevelEd to reduce the CPU load in particular during merge events.
|
||||||
|
|
||||||
|
## Riak Cluster Test - 1
|
||||||
|
|
||||||
|
The First test on a Riak Cluster has been based on the following configuration:
|
||||||
|
|
||||||
|
- A 5 node cluster
|
||||||
|
- Using i2.2xlarge EC2 nodes with mirrored SSD drives (for data partition only)
|
||||||
|
- noop scheduler, transparent huge pages disabled, ext4 partition
|
||||||
|
- A 64 vnode ring-size
|
||||||
|
- 45 concurrent basho_bench threads (basho_bench run on separate disks) running at max
|
||||||
|
- AAE set to passive
|
||||||
|
- sync writes enabled (on both backends)
|
||||||
|
- An object size of 8KB
|
||||||
|
- A pareto distribution of requests with a keyspace of 50M keys
|
||||||
|
- 5 GETs for each UPDATE
|
||||||
|
- 4 hour test run
|
||||||
|
|
||||||
|
This test showed a <b>73.9%</b> improvement in throughput when using LevelEd, but more importantly a huge improvement in variance in tail latency. Through the course of the test the average of the maximum response times (in each 10s period) were
|
||||||
|
|
||||||
|
leveled GET mean(max) | eleveldb GET mean(max)
|
||||||
|
:-------------------------:|:-------------------------:
|
||||||
|
21.7ms | 410.2ms
|
||||||
|
|
||||||
|
leveled PUT mean(max) | eleveldb PUT mean(max)
|
||||||
|
:-------------------------:|:-------------------------:
|
||||||
|
101.5ms | 2,301.6ms
|
||||||
|
|
||||||
|
Tail latency under load is around in leveled is less than 5% of the comparable value in eleveldb (note there is a significant difference in the y-axis scale between the latency charts on these graphs).
|
||||||
|
|
||||||
|
leveled Results | eleveldb Results
|
||||||
|
:-------------------------:|:-------------------------:
|
||||||
|
 | 
|
||||||
|
|
||||||
|
### Lies, damned lies etc
|
||||||
|
|
||||||
|
To a certain extent this should not be too unexpected - leveled is design to reduce write amplification, without write amplification the persistent write load gives leveled an advantage. The frequent periods of poor performance in leveldb appear to be coordinated with periods of very high await times on nodes during merge jobs, which may involve up to o(1GB) of write activity.
|
||||||
|
|
||||||
|
Also the 5:1 ratio of GET:UPDATE is not quite that as:
|
||||||
|
|
||||||
|
- each UPDATE requires an external Riak GET (as well as the internal GETs);
|
||||||
|
|
||||||
|
- the empty nature of the database at the test start means that there are no actual value fetches initially (just not-present response) and only 50% of fetches get a value by the end of the test (much less for leveldb as there is less volume PUT during the test).
|
||||||
|
|
||||||
|
When testing on a single node cluster (with a smaller ring size, and a smaller keyspace) the relative benefit of leveled appears to be much smaller. One big difference between the single node and multi-node testing undertaken is that between the tests the disk was switched from using a single drive to using a mirrored pair. It is suspected that the amplified improvement between single-node test and multi-node tests is related in-part to the cost of software-based mirroring exaggerating write contention to disk.
|
||||||
|
|
||||||
|
Leveldb achieved more work in a sense during the test, as the test was run outside of the compaction window for leveled - so the on-disk size of the leveled store was higher as no replaced values had been compacted. Test 6 below will examine the impact of the compaction window on throughput.
|
||||||
|
|
||||||
|
On the flip side, it could be argued that the 73% difference under-estimates the difference in that the greater volume achieved meant that by the back-end of the test more GETs were requiring fetches in leveled when compared with leveldb. Because of this, by the last ten minutes of the test, the throughput advantage had been reduced to <b>55.4%</b>.
|
||||||
|
|
||||||
|
## Riak Cluster Test - 2
|
||||||
|
|
||||||
|
to be completed ..
|
||||||
|
|
||||||
|
As above but on d2.2xlarge EC2 nodes for HDD comparison
|
||||||
|
|
||||||
|
## Riak Cluster Test - 3
|
||||||
|
|
||||||
|
to be completed ..
|
||||||
|
|
||||||
|
Testing with optimised GET FSM (which checks HEAD requests from r nodes, and only GET request from one node if no sibling resolution required)
|
||||||
|
|
||||||
|
## Riak Cluster Test - 4
|
||||||
|
|
||||||
|
to be completed ..
|
||||||
|
|
||||||
|
Testing with optimised PUT FSM (switching from GET before PUT to HEAD before PUT)
|
||||||
|
|
||||||
|
## Riak Cluster Test - 5
|
||||||
|
|
||||||
|
to be completed ..
|
||||||
|
|
||||||
|
Testing with changed hashtree logic in Riak so key/clock scan is effective
|
||||||
|
|
||||||
|
## Riak Cluster Test - 6
|
||||||
|
|
||||||
|
to be completed ..
|
||||||
|
|
||||||
|
Testing during a journal compaction window
|
||||||
|
|
||||||
|
## Riak Cluster Test - 7
|
||||||
|
|
||||||
|
to be completed ..
|
||||||
|
|
||||||
|
Testing for load including 2i queries
|
||||||
|
|
||||||
|
|
||||||
|
|
22
docs/VOLUME_PRERIAK.md
Normal file
|
@ -0,0 +1,22 @@
|
||||||
|
# Volume Testing
|
||||||
|
|
||||||
|
## Parallel Node Testing - Non-Riak
|
||||||
|
|
||||||
|
Initial volume tests have been [based on standard basho_bench eleveldb test](../test/volume/single_node/examples) to run multiple stores in parallel on the same node and and subjecting them to concurrent pressure.
|
||||||
|
|
||||||
|
This showed a relative positive performance for leveled for both population and load.
|
||||||
|
|
||||||
|
Populate leveled | Populate eleveldb
|
||||||
|
:-------------------------:|:-------------------------:
|
||||||
|
 | 
|
||||||
|
|
||||||
|
Load leveled | Load eleveldb
|
||||||
|
:-------------------------:|:-------------------------:
|
||||||
|
 | 
|
||||||
|
|
||||||
|
This test was a positive comparison for LevelEd, but also showed that although the LevelEd throughput was relatively stable it was still subject to fluctuations related to CPU constraints. Prior to moving on to full Riak testing, a number of changes where then made to LevelEd to reduce the CPU load in particular during merge events.
|
||||||
|
|
||||||
|
The eleveldb results may not be a fair representation of performance in that:
|
||||||
|
|
||||||
|
- Within Riak it was to be expected that eleveldb would start with an appropriate default configuration that might better utilise available memory.
|
||||||
|
- The test node use had a single desktop (SSD-based) drive, and the the 'low' points of eleveldb performance were associated with disk constraints. Running on more genuine servers with high-performance disks may give better performance.
|
49
docs/WHY.md
Normal file
|
@ -0,0 +1,49 @@
|
||||||
|
# Why? Why? Why? Why? Why?
|
||||||
|
|
||||||
|
## Why not just use RocksDB?
|
||||||
|
|
||||||
|
Well that wouldn't be interesting.
|
||||||
|
|
||||||
|
All LSM-trees which evolve off the back of leveldb are trying to improve leveldb in a particular context. I'm considering larger values, with need for iterators and object time-to-lives, optimisations by supporting HEAD requests and also the problem of running multiple isolated nodes in parallel on a single server.
|
||||||
|
|
||||||
|
I suspect that [context of what I'm trying to achieve](https://github.com/basho/riak_kv/issues/1033) matters as much as much as other factors. I [could be wrong](https://github.com/basho/riak/issues/756).
|
||||||
|
|
||||||
|
## Why not use memory-mapped files?
|
||||||
|
|
||||||
|
It is hard not to be impressed with performance results from LMDB, is memory-mapping magic?
|
||||||
|
|
||||||
|
I tried hard to fully understand memory-mapping before I started, and ultimately wasn't clear that I fully understood it. However, the following things made me consider that memory-mapping may not have a fundamental difference in this context:
|
||||||
|
|
||||||
|
- I didn't understand how a swap to user space can be avoided if the persisted information is compressed, and wanted the full value of compression in the page cache.
|
||||||
|
|
||||||
|
- Most memory map performance results tended to have Keys and Values of convenient sizes to memory pages and/or CPU cache lines, but in leveled I'm specifically looking objects of unpredictable and inconvenient size (normally > 4KB).
|
||||||
|
|
||||||
|
- Objects are ultimately processed as Erlang terms of some sort, again it is not obvious how that could be serialised and de-serialised without losing the advantage of memory-mapping.
|
||||||
|
|
||||||
|
- The one process per file design of leveled makes database cloning easy but eliminates some advantages of memory mapping.
|
||||||
|
|
||||||
|
## Why Erlang?
|
||||||
|
|
||||||
|
Honestly, writing stuff that consistently behaves predictably and safely is hard, achieving this in C or C++ would be beyond my skills. However, further to the [introduction](INTRO.md), CPU is not a constraint I'm interested in as this is a plentiful cheap resource. Neither is low median latency a design consideration. Writing this in another language may result in faster code, but would that be faster in a way that would be material in this context?
|
||||||
|
|
||||||
|
How to approach taking a snapshot of a database for queries, how to run background compaction/merge work in parallel to real activity - these are non-trivial challenges and the programming model of Erlang/OTP makes this a lot, lot easier. The end intention is to include this in Riak KV, and hence reducing serialisation steps, human context-switching and managing reductions correctly means that there may be some distinct advantages in using Erlang.
|
||||||
|
|
||||||
|
Theory of constraints guides me to consider optimisation only of the bottleneck, and the bottleneck appears to be access to disk. A [futile fortnight](https://github.com/martinsumner/eleveleddb/tree/mas-nifile/src) trying to use a NIF for file read/write performance improvements convinced me that Erlang does this natively with reasonable efficiency.
|
||||||
|
|
||||||
|
## Why not cache objects in memory intelligently?
|
||||||
|
|
||||||
|
The eleveldb smart caching of objects in memory is impressive, in particular in how it auto-adjusts the amount of memory used to fit with the number of vnodes currently running on the physical server.
|
||||||
|
|
||||||
|
From the start I decided that fadvise would be my friend, in part as:
|
||||||
|
|
||||||
|
- It was easier to depend on the operating system to understand and handle LRU management.
|
||||||
|
|
||||||
|
- I'm expecting various persisted elements (not just values, but also key ranges) to benefit from compression, so want to maximise the volume of data holdable in memory (by caching the compressed view in the page cache).
|
||||||
|
|
||||||
|
- I expect to reduce the page-cache polluting events that can occur in Riak by reducing object scanning.
|
||||||
|
|
||||||
|
Ultimately though, sophisticated memory management is hard, and beyond my capability in the timescale available.
|
||||||
|
|
||||||
|
## Why make this backwards compatible with OTP16?
|
||||||
|
|
||||||
|
Yes why, why do I have to do this?
|
BIN
docs/pics/inker.png
Normal file
After Width: | Height: | Size: 429 KiB |
BIN
docs/pics/leveled_intro.png
Normal file
After Width: | Height: | Size: 235 KiB |
BIN
docs/pics/penciller.png
Normal file
After Width: | Height: | Size: 284 KiB |
BIN
test/volume/cluster_one/output/summary_leveldb_5n_45t.png
Normal file
After Width: | Height: | Size: 117 KiB |
BIN
test/volume/cluster_one/output/summary_leveled_5n_45t.png
Normal file
After Width: | Height: | Size: 93 KiB |
Before Width: | Height: | Size: 316 KiB After Width: | Height: | Size: 316 KiB |
Before Width: | Height: | Size: 315 KiB After Width: | Height: | Size: 315 KiB |
Before Width: | Height: | Size: 333 KiB After Width: | Height: | Size: 333 KiB |
Before Width: | Height: | Size: 274 KiB After Width: | Height: | Size: 274 KiB |