17
README.md
|
@ -5,14 +5,19 @@
|
||||||
LevelEd is a work-in-progress prototype of a simple Key-Value store based on the concept of Log-Structured Merge Trees, with the following characteristics:
|
LevelEd is a work-in-progress prototype of a simple Key-Value store based on the concept of Log-Structured Merge Trees, with the following characteristics:
|
||||||
|
|
||||||
- Optimised for workloads with larger values (e.g. > 4KB).
|
- Optimised for workloads with larger values (e.g. > 4KB).
|
||||||
- Supporting of values split between metadata and body, and allowing for HEAD requests which have lower overheads than GET requests.
|
- Explicitly supports HEAD requests in addition to GET requests.
|
||||||
|
- Splits the storage of value between key/metadata and body, and allowing for HEAD requests which have lower overheads than GET requests, and queries which traverse keys/metadatas to be supported with fewer side effects on the page cache.
|
||||||
- Support for tagging of object types and the implementation of alternative store behaviour based on type.
|
- Support for tagging of object types and the implementation of alternative store behaviour based on type.
|
||||||
- Support for low-cost clones without locking to provide for scanning queries, specifically where there is a need to scan across keys and metadata (not values).
|
- Potentially usable for objects with special retention or merge properties.
|
||||||
|
- Support for low-cost clones without locking to provide for scanning queries.
|
||||||
|
- Low cost specifically where there is a need to scan across keys and metadata (not values).
|
||||||
- Written in Erlang as a message passing system between Actors.
|
- Written in Erlang as a message passing system between Actors.
|
||||||
|
|
||||||
The store has been developed with a focus on being a potential backend to a Riak KV database, rather than as a generic store. The primary aim of developing (yet another) Riak backend is to examine the potential to reduce the broader costs providing sustained throughput in Riak i.e. to provide equivalent throughput on cheaper hardware. It is also anticipated in having a fully-featured pure Erlang backend may assist in evolving new features through the Riak ecosystem which require end-to-end changes.
|
The store has been developed with a focus on being a potential backend to a Riak KV database, rather than as a generic store.
|
||||||
|
|
||||||
The store is not expected to be 'faster' than leveldb (the primary fully-featured Riak backend available today), but it is likely in some cases to offer improvements in throughput.
|
The primary aim of developing (yet another) Riak backend is to examine the potential to reduce the broader costs providing sustained throughput in Riak i.e. to provide equivalent throughput on cheaper hardware. It is also anticipated in having a fully-featured pure Erlang backend may assist in evolving new features through the Riak ecosystem which require end-to-end changes, rather than requiring context switching between C++ and Erlang based components.
|
||||||
|
|
||||||
|
The store is not expected to offer lower median latency than leveldb (the primary fully-featured Riak backend available today), but it is likely in some cases to offer improvements in throughput.
|
||||||
|
|
||||||
## More Details
|
## More Details
|
||||||
|
|
||||||
|
@ -32,6 +37,8 @@ At the initiation of the project I accepted that making a positive contribution
|
||||||
|
|
||||||
The target at inception was to do something interesting, something that articulates through working software the potential for improvement to exist by re-thinking certain key assumptions and trade-offs.
|
The target at inception was to do something interesting, something that articulates through working software the potential for improvement to exist by re-thinking certain key assumptions and trade-offs.
|
||||||
|
|
||||||
[Initial volume tests](docs/VOLUME.md) indicate that it is at least interesting, with substantial improvements in both throughput (73%) and tail latency (1:20) when compared to eleveldb - in scenarios expected to be optimised for leveled. Note, to be clear, this is statement falls well short of making a general claim that it represents a 'better' Riak backend than leveldb.
|
[Initial volume tests](docs/VOLUME.md) indicate that it is at least interesting, with substantial improvements in both throughput (73%) and tail latency (1:20) when compared to eleveldb - when using non-trivial object sizes. The largest improvement was with syncing to disk enabled on solid-state drives, but improvement has also been discovered with this object size without sync being enabled, both on SSDs and traditional hard-disk drives.
|
||||||
|
|
||||||
|
The hope is that LevelEd may be able to support generally more stable and predictable throughput with larger object sizes, especially with larger key-spaces. More importantly, in the scenarios tested the constraint on throughput is more consistently CPU-based, and not disk-based. This potentially makes capacity planning simpler, and opens up the possibility of scaling out to equivalent throughput at much lower cost (as CPU is relatively low cost when compared to disk space at high I/O) - [offering better alignment between resource constraints and the cost of resource](docs/INTRO.md).
|
||||||
|
|
||||||
More information can be found in the [volume testing section](docs/VOLUME.md).
|
More information can be found in the [volume testing section](docs/VOLUME.md).
|
|
@ -44,9 +44,11 @@ Both the Inker and the Penciller must undertake compaction work. The Inker must
|
||||||
|
|
||||||
Both the Penciller and the Inker each make use of their own dedicated clerk for completing this work. The Clerk will add all new files necessary to represent the new view of that part of the store, and then update the Inker/Penciller with the new manifest that represents that view. Once the update has been acknowledged, any removed files can be marked as delete_pending, and they will poll the Inker (if a Journal file) or Penciller (if a Ledger file) for it to confirm that no clones of the system still depend on the old snapshot of the store to be maintained.
|
Both the Penciller and the Inker each make use of their own dedicated clerk for completing this work. The Clerk will add all new files necessary to represent the new view of that part of the store, and then update the Inker/Penciller with the new manifest that represents that view. Once the update has been acknowledged, any removed files can be marked as delete_pending, and they will poll the Inker (if a Journal file) or Penciller (if a Ledger file) for it to confirm that no clones of the system still depend on the old snapshot of the store to be maintained.
|
||||||
|
|
||||||
|
The Penciller's clerk work to compact the Ledger is effectively continuous, it will regularly poll the Penciller for compaction work, and do work whenever work is necessary. The Inker's clerk compaction responsibility is expected to be scheduled. The Inker's Clerk will need to determine if compaction work is necessary when scheduled, and it does this by asking each CDB file for a random sample of keys and object sizes, and the file is the scored by the Clerk for compaction based on the indication from the sample of the likely proportion of space that will be recovered. The Inker's Clerk will compact only the most rewarding 'run' of journal files, if and only if the most rewarding run surpasses a threshold score.
|
||||||
|
|
||||||
### File Clerks
|
### File Clerks
|
||||||
|
|
||||||
Every file within the store has is owned by its own dedicated process (modelled as a finite state machine). Files are never created or accessed by the Inker or the Penciller, interactions with the files are managed through messages sent to the File Clerk processes which own the files.
|
Every file within the store has its own dedicated process (modelled as a finite state machine). Files are never created or accessed by the Inker or the Penciller, interactions with the files are managed through messages sent to the File Clerk processes which own the files.
|
||||||
|
|
||||||
The File Clerks themselves are ignorant to their context within the store. For example a file in the Ledger does not know what level of the Tree it resides in. The state of the store is represented by the Manifest which maintains a picture of the store, and contains the process IDs of the file clerks which represent the files.
|
The File Clerks themselves are ignorant to their context within the store. For example a file in the Ledger does not know what level of the Tree it resides in. The state of the store is represented by the Manifest which maintains a picture of the store, and contains the process IDs of the file clerks which represent the files.
|
||||||
|
|
||||||
|
@ -58,15 +60,29 @@ File clerks spend a short initial portion of their life in a writable state. On
|
||||||
|
|
||||||
## Clones
|
## Clones
|
||||||
|
|
||||||
Both the Penciller and the Inker can be cloned, to provide a snapshot of the database at a point in time for long-running process that can be run concurrently to other database actions. Clones are used for Journal compaction, but also for scanning queries in the penciller (for example to support 2i queries or hashtree rebuilds in Riak).
|
Both the Penciller and the Inker can be cloned, to provide a snapshot of the database at a point in time. A long running process may then use this clone to query the database concurrently to other database actions. Clones are used for Journal compaction, but also for scanning queries in the penciller (for example to support 2i queries or hashtree rebuilds in Riak).
|
||||||
|
|
||||||
The snapshot process is simple. The necessary loop-state is requested from the real worker, in particular the manifest and any immutable in-memory caches, and a new gen_server work is started with the loop state. The clone registers itself as a snapshot with the real worker, with a timeout that will allow the snapshot to expire if the clone silently terminates. The clone will then perform its work, making requests to the file workers referred to in the manifest. Once the work is complete the clone should remove itself from the snapshot register in the real worker before closing.
|
The snapshot process is simple. The necessary loop-state is requested from the real worker, in particular the manifest and any immutable in-memory caches, and a new gen_server worker is started with the loop state. The clone registers itself as a snapshot with the real worker, with a timeout that will allow the snapshot to expire if the clone silently terminates. The clone will then perform its work, making requests to the file workers referred to in the manifest. Once the work is complete the clone should remove itself from the snapshot register in the real worker before closing.
|
||||||
|
|
||||||
The snapshot register is used by the real worker when file workers are placed in the delete_pending state after they have been replaced in the current manifest. Checking the list of registered snapshots allows the Penciller or Inker to inform the File Clerk if they remove themselves permanently - as no remaining clones may expect them to be present (except those who have timed out).
|
The snapshot register is used by the real worker when file workers are placed in the delete_pending state. File processes enter this state when they have been replaced in the current manifest, but access may still be needed by a cloned process using an older version of the manifest. The file process should poll the Penciller/Inker in this state to check if deletion can be completed, and the Penciller/Inker should check the register of snapshots to confirm that no active snapshot has a potential dependency on the file before responding to proceed with deletion.
|
||||||
|
|
||||||
Managing the system in this way requires that ETS tables are used sparingly for holding in-memory state, and immutable and hence shareable objects are used instead. The exception to the rule is the small in-memory state of recent additions kept by the Bookie - which must be exported to a list on every snapshot request.
|
Managing the system in this way requires that ETS tables are used sparingly for holding in-memory state, and immutable and hence shareable objects are used instead. The exception to the rule is the small in-memory state of recent additions kept by the Bookie - which must be exported to a list on every snapshot request.
|
||||||
|
|
||||||
## Paths
|
## Types
|
||||||
|
|
||||||
|
Objects in LevelEd all have a type (or tag). The tag determines how the object will be compacted in both the Ledger and Journal, and what conversions are applied when fetching the object or when splitting the object metadata to convert from the Inker to the Journal.
|
||||||
|
|
||||||
|
Three types are initially supported:
|
||||||
|
|
||||||
|
- o - standard object
|
||||||
|
|
||||||
|
- o_rkv - riak object
|
||||||
|
|
||||||
|
- i - index term
|
||||||
|
|
||||||
|
All Ledger Keys created for any type must be 4-tuples starting with the tag. Abstraction with regards to types is currently imperfect, but the expectation is that these types will make support for application specific behaviours easier to achieve, such as behaviours which maybe required to support different [CRDTs](https://en.wikipedia.org/wiki/Conflict-free_replicated_data_type).
|
||||||
|
|
||||||
|
## GET/PUT Paths
|
||||||
|
|
||||||
The PUT path for new objects and object changes depends on the Bookie interacting with the Inker to ensure that the change has been persisted with the Journal, the Ledger is updated in batches after the PUT has been completed.
|
The PUT path for new objects and object changes depends on the Bookie interacting with the Inker to ensure that the change has been persisted with the Journal, the Ledger is updated in batches after the PUT has been completed.
|
||||||
|
|
||||||
|
@ -76,3 +92,33 @@ The GET path follows the HEAD path, but once the sequence number has been determ
|
||||||
|
|
||||||
All other queries (folds over indexes, keys and objects) are managed by cloning either the Penciller, or the Penciller and the Inker.
|
All other queries (folds over indexes, keys and objects) are managed by cloning either the Penciller, or the Penciller and the Inker.
|
||||||
|
|
||||||
|
## Startup
|
||||||
|
|
||||||
|
On startup the Bookie will start, and first prompt the Inker to startup. the Inker will load the manifest and attempt to start a process for each file in the manifest, and then update the manifest with those process IDs. The Inker will also look on disk for any CDB files more recent than the manifest, and will open those at the head of the manifest. Once Inker startup is complete, the Bookie will prompt the Penciller to startup.
|
||||||
|
|
||||||
|
The Penciller will, as with the Inker, open the manifest and then start a process for each file in the manifest to open those files. The Penciller will also look for a persisted level-zero file which was more recent than the current manifest and open a process for this too. As part of this process the Penciller determines the highest sequence number that has been seen in the persisted Ledger. Once the highest sequence number has determined the Penciller will request a reload from that sequence number.
|
||||||
|
|
||||||
|
To complete the reload request, the Inker will scan all the Keys/Values in the Journal from that sequence number (in batches), converting the Keys and Values into the necessary Ledger changes, and loading those changes into the Ledger via the Penciller.
|
||||||
|
|
||||||
|
Once the Ledger is up-to-date with the very latest changes in the Journal, the Bookie signals that startup is complete and it is ready to receive load.
|
||||||
|
|
||||||
|
## Recovery
|
||||||
|
|
||||||
|
Theoretically, the entire Ledger can be restored from the Journal, if the Ledger were erased and then the Bookie was started as normal. This was intended as a potential simplified backup solution, as this would require only the slowly mutating Journal to be backed up, and so would be suited to an incremental delta-based backup strategy and restoring to points in time. There are some potential issues with this:
|
||||||
|
|
||||||
|
- It is not clear how long scanning over the Journal would take with an empty ledger, and it would certainly be too long for the standard Riak vnode start timeout.
|
||||||
|
|
||||||
|
- Some changes build over history, and so would be lost if earlier changes are lost due to compaction of out-of-date values (e.g. 2i index changes in Riak).
|
||||||
|
|
||||||
|
- Journal compaction becomes more complicated if it is necessary to recover in a scenario where the head of the Journal has been corrupted (so if we have B.0, B.1 and B.2 as three versions of the object, if B.2 is lost due to corruption and B.1 is compacted then B.0 will be resurrected on recovery - so post compaction there is no point in time recovery possible).
|
||||||
|
|
||||||
|
Three potential recovery strategies are supported to provide some flexibility for dealing with this:
|
||||||
|
|
||||||
|
- recovr - if a value is out of date at the time of compaction, remove it. This strategy assumes anomalies can be recovered by a global anti-entropy strategy, or that the persisted part of the Ledger is never lost (the Inker's Clerk will never compact a value received after the highest persisted sequence number in the Ledger).
|
||||||
|
|
||||||
|
- retain - on compaction KeyDeltas are retained in the Journal, only values are removed.
|
||||||
|
|
||||||
|
- recalc (not yet implemented) - the compaction rules assume that on recovery the key changes will be recalculated by comparing the change with the current database state.
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
|
@ -6,12 +6,18 @@ The store supports all the required Riak backend capabilities. A number of furt
|
||||||
|
|
||||||
- Support for "Tags" in keys to allow for different treatment of different key types (i.e. changes in merge actions, metadata retention, caching for lookups).
|
- Support for "Tags" in keys to allow for different treatment of different key types (i.e. changes in merge actions, metadata retention, caching for lookups).
|
||||||
|
|
||||||
|
- Support for Sub-Keys in objects. There may be the potential to use this along with key types to support alternative approaches to CRDT sets or local rather than global secondary indexes.
|
||||||
|
|
||||||
- A fast "list bucket" capability, borrowing from the HanoiDB implementation, to allow for safe bucket listing in production.
|
- A fast "list bucket" capability, borrowing from the HanoiDB implementation, to allow for safe bucket listing in production.
|
||||||
|
|
||||||
- A bucket size query, which requires traversal only of the Ledger and counts the keys and sums he total on-disk size of the objects within the bucket.
|
- A bucket size query, which requires traversal only of the Ledger and counts the keys and sums the total on-disk size of the objects within the bucket.
|
||||||
|
|
||||||
|
- A PUT flag that allows a truer meaning to the existing DW setting, allowing flushing to disk to be prompted by the need to meet the DW flag, rather than always/never
|
||||||
|
|
||||||
- Support for a specific Riak tombstone tag where reaping of tombstones can be deferred (by many days) i.e. so that a 'sort-of-keep' deletion strategy can be followed that will eventually garbage collect without the need to hold pending full deletion state in memory.
|
- Support for a specific Riak tombstone tag where reaping of tombstones can be deferred (by many days) i.e. so that a 'sort-of-keep' deletion strategy can be followed that will eventually garbage collect without the need to hold pending full deletion state in memory.
|
||||||
|
|
||||||
|
- The potential to support Map/Reduce functions on object metadata not values, so that cache-pollution and disk i/o overheads of full object Map/Reduce can be eliminated by smarter object construction strategies that promote information designed for queries from values into metadata.
|
||||||
|
|
||||||
|
|
||||||
## Outstanding work
|
## Outstanding work
|
||||||
|
|
||||||
|
@ -29,4 +35,63 @@ There is some work required before LevelEd could be considered production ready:
|
||||||
|
|
||||||
- Improved handling of corrupted files.
|
- Improved handling of corrupted files.
|
||||||
|
|
||||||
- A way of identifying the partition in each log to ease the difficulty of tracing activity when multiple stores are run in parallel.
|
- A way of identifying the partition in each log to ease the difficulty of tracing activity when multiple stores are run in parallel.
|
||||||
|
|
||||||
|
|
||||||
|
## Riak Features Implemented
|
||||||
|
|
||||||
|
The following Riak features have been implemented
|
||||||
|
|
||||||
|
### LevelEd Backend
|
||||||
|
|
||||||
|
Branch: [mas-leveleddb](https://github.com/martinsumner/riak_kv/tree/mas-leveleddb)
|
||||||
|
|
||||||
|
Description:
|
||||||
|
|
||||||
|
The leveled backend has been implemented with some basic manual functional tests. The backend has the following capabilities:
|
||||||
|
|
||||||
|
- async_fold - can return a Folder() function in response to 2i queries;
|
||||||
|
- indexes - support for secondary indexes (currently being returned in the wrong order);
|
||||||
|
- head - supports a HEAD as well as a GET request;
|
||||||
|
- hash_query - can return keys/hashes when rebuilding hashtrees, so implementation of hashtree can call this rather than folding objects and hashing the objects returning from the fold.
|
||||||
|
|
||||||
|
All standard features should (in theory) be supportable (e.g. TTL, M/R, term_regex etc) but aren't subject to end-to_end testing at present.
|
||||||
|
|
||||||
|
There is a uses_r_object capability that leveled should in theory be able to support (making it unnecessary to convert to/from binary form). However, given that only the yessir backend currently supports this, the feature hasn't been implemented in case there are potential issues lurking within Riak when using this capability.
|
||||||
|
|
||||||
|
### Fast List-Buckets
|
||||||
|
|
||||||
|
Normally Basho advise not to list buckets in production - as on other backends it is a hugely expensive operation. Accidentally running list buckets can lead to significant resource pressure, and also rapid consumption of available disk space.
|
||||||
|
|
||||||
|
The LevelEd backend borrows from the hanoidb implementation of list buckets, which supports listing buckets with a simple low-cost async fold. So there should be no issue listing buckets in production with these backends.
|
||||||
|
|
||||||
|
Note - the technique would work in leveldb and memory backends as well (and perhaps bitcask), it just isn't implemented there. Listing buckets should not really be an issue.
|
||||||
|
|
||||||
|
### GET_FSM -> Using HEAD
|
||||||
|
|
||||||
|
Branch: [mas-leveled-getfsm](https://github.com/martinsumner/riak_kv/tree/mas-leveled-getfsm)
|
||||||
|
|
||||||
|
Description:
|
||||||
|
|
||||||
|
In standard Riak the Riak node that receives a GET request starts a riak_kv_get_fsm to handle that request. This FSM goes through the following primary states:
|
||||||
|
|
||||||
|
- prepare (get preflists etc)
|
||||||
|
- valiidate (make sure there is sufficient vnode availability to handle the request)
|
||||||
|
- execute (send the GET request to n vnodes)
|
||||||
|
- waiting_vnode_r (loop waiting for the response from r vnodes, every time a response is received check if enough vnodes have responded, and then either finalise a response or wait for more)
|
||||||
|
- waiting_read_repair (read repair if a responding vnode is out of date)
|
||||||
|
|
||||||
|
The alternative FSM in this branch makes the following changes
|
||||||
|
- sends HEAD not GET requests at the execute stage
|
||||||
|
- when 'enough' is received in waiting_vnode_r, elects a vnode or vnodes from which to fetch the body
|
||||||
|
- recall execute asking for just the necessary GET requests
|
||||||
|
- in waiting_vnode_r the second time update HEAD responses to GET responses, and recalculate the winner (including body) when all vnodes which have been sent a GET request have responded
|
||||||
|
|
||||||
|
So rather than doing three Key/Metadata/Body backend lookups for every request, normally (assuming no siblings) Riak now requires three Key/Metadata and one Key/Metadata/Body lookup. For larger bodies, this then avoids the additional disk activity and network load associated with fetching and transmitting the body three times for every request. There are some other side effects:
|
||||||
|
|
||||||
|
- It is more likely to detect out-of-date objects in slow nodes (as the n-r responses may still be received and added to the result set when waiting for the GET response in the second loop).
|
||||||
|
- The database is in-theory much less likely to have [TCP Incast](http://www.snookles.com/slf-blog/2012/01/05/tcp-incast-what-is-it/) issues, hence reducing the cost of network infrastructure, and reducing the risk of running Riak in shared infrastructure environments.
|
||||||
|
|
||||||
|
The feature will not at present work safely with legacy vclocks. This branch generally relies on vector clock comparison only for equality checking, and removes some of the relatively expensive whole body equality tests (either as a result of set:from_list/1 or riak_object:equal/2), which are believed to be a legacy of issues with pre-dvv clocks.
|
||||||
|
|
||||||
|
In tests, the benefit of this may not be that significant - as the primary resource saved is disk/network, and so if these are not the resources under pressure, the gain may not be significant. In tests bound by CPU not disks, only a 10% improvement has so far been measured with this feature.
|
||||||
|
|
|
@ -4,7 +4,7 @@ The following section is a brief overview of some of the motivating factors behi
|
||||||
|
|
||||||
## A Paper To Love
|
## A Paper To Love
|
||||||
|
|
||||||
The concept of a Log Structured Merge Tree is described within the 1996 paper ["The Log Structured Merge Tree"](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.44.2782&rep=rep1&type=pdf) by Patrick O'Neil et al. The paper is not specific on precisely how a LSM-Tree should be implemented, proposing a series of potential options. The paper's focus is on framing the justification for design decisions in the context of hardware economics.
|
The concept of a Log Structured Merge Tree is described within the 1996 paper ["The Log Structured Merge Tree"](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.44.2782&rep=rep1&type=pdf) by Patrick O'Neil et al. The paper is not specific on precisely how an LSM-Tree should be implemented, proposing a series of potential options. The paper's focus is on framing the justification for design decisions in the context of hardware economics.
|
||||||
|
|
||||||
The hardware economics at the time of the paper were:
|
The hardware economics at the time of the paper were:
|
||||||
|
|
||||||
|
@ -24,13 +24,13 @@ Based on the experience of running Riak at scale in production for the NHS, what
|
||||||
|
|
||||||
The purchase costs of disk though, do not accurately reflect the running costs of disks - because disks fail, and fail often. Also the hardware choices made by the NHS for the Spine programme, do not necessarily reflect the generic industry choices and costs.
|
The purchase costs of disk though, do not accurately reflect the running costs of disks - because disks fail, and fail often. Also the hardware choices made by the NHS for the Spine programme, do not necessarily reflect the generic industry choices and costs.
|
||||||
|
|
||||||
To get an up-to-date objective measure of the overall exists are, assuming data centres are run with a high degree of efficiency and savings of scale - the price list for Amazon EC2 instances can assist, by directly comparing the current prices of different instances.
|
To get an up-to-date and objective measure of what the overall costs are, the Amazon price list can assist if we assume their data centres are managed with a high-degree of efficiency due to their scale. Assumptions on the costs of individual components can be made by examining differences between specific instance prices.
|
||||||
|
|
||||||
As at 26th January 2017, the [on-demand instance pricing](https://aws.amazon.com/ec2/pricing/on-demand/) for servers is:
|
As at 26th January 2017, the [on-demand instance pricing](https://aws.amazon.com/ec2/pricing/on-demand/) for servers is:
|
||||||
|
|
||||||
- c4.4xlarge - 16 CPU, 30 GB RAM, EBS only - $0.796
|
- c4.4xlarge - 16 CPU, 30 GB RAM, EBS only - $0.796
|
||||||
|
|
||||||
- r4.xlarge - 4 CPU, 30GB RAM, EBS only - $0.266
|
- r4.xlarge - 4 CPU, 30 GB RAM, EBS only - $0.266
|
||||||
|
|
||||||
- r4.4xlarge - 16 CPU, 122 GB RAM, EBS only - $1.064
|
- r4.4xlarge - 16 CPU, 122 GB RAM, EBS only - $1.064
|
||||||
|
|
||||||
|
@ -54,7 +54,9 @@ Compared to the figures at the time of the LSM-Tree paper, the actual delta in t
|
||||||
|
|
||||||
The availability of SDDs is not a silver bullet to disk i/o problems when cost is considered, as although they eliminate the additional costs of random page access through the removal of the disk head movement overhead (of about 6.5ms per shift), this benefit is at an order of magnitude difference in cost compared to spinning disks, and at a cost greater than half the price of DRAM. SSDs have not taken the problem of managing the overheads of disk persistence away, they've simply added another dimension to the economic profiling problem.
|
The availability of SDDs is not a silver bullet to disk i/o problems when cost is considered, as although they eliminate the additional costs of random page access through the removal of the disk head movement overhead (of about 6.5ms per shift), this benefit is at an order of magnitude difference in cost compared to spinning disks, and at a cost greater than half the price of DRAM. SSDs have not taken the problem of managing the overheads of disk persistence away, they've simply added another dimension to the economic profiling problem.
|
||||||
|
|
||||||
In physical on-premise server environments there is also commonly the cost of disk controllers. Disk controllers bend the economics of persistence through the presence of flash-backed write caches. However, disk controllers also fail - within the NHS environment disk controller failures are the second most common device failure after individual disks. Failures of disk controllers are also expensive to resolve, not being hot-pluggable like disks, and carrying greater risk of node data-loss due to either bad luck or bad process during the change. It is noticeable that EC2 does not have disk controllers and given their failure rate and cost of recovery, this appears to be a sensible trade-off.
|
In physical on-premise server environments there is also commonly the cost of disk controllers. Disk controllers bend the economics of persistence through the presence of flash-backed write caches. However, disk controllers also fail - within the NHS environment disk controller failures are the second most common device failure after individual disks. Failures of disk controllers are also expensive to resolve, not being hot-pluggable like disks, and carrying greater risk of node data-loss due to either bad luck or bad process during the change.
|
||||||
|
|
||||||
|
It is noticeable that EC2 does not have disk controllers and given their failure rate and cost of recovery, this appears to be a sensible trade-off. However, software-only RAID has drawbacks, include the time to setup RAID (24 hours on a d2.2xlarge node), recover from a disk failure and the time to run [scheduled checks](https://www.thomas-krenn.com/en/wiki/Mdadm_checkarray).
|
||||||
|
|
||||||
Making cost-driven decisions about storage design remains as relevant now as it was two decades ago when the LSM-Tree paper was published, especially as we can now directly see those costs reflected in hourly resource charges.
|
Making cost-driven decisions about storage design remains as relevant now as it was two decades ago when the LSM-Tree paper was published, especially as we can now directly see those costs reflected in hourly resource charges.
|
||||||
|
|
||||||
|
|
|
@ -14,11 +14,11 @@ The First test on a Riak Cluster has been based on the following configuration:
|
||||||
- Using i2.2xlarge EC2 nodes with mirrored SSD drives (for data partition only)
|
- Using i2.2xlarge EC2 nodes with mirrored SSD drives (for data partition only)
|
||||||
- noop scheduler, transparent huge pages disabled, ext4 partition
|
- noop scheduler, transparent huge pages disabled, ext4 partition
|
||||||
- A 64 vnode ring-size
|
- A 64 vnode ring-size
|
||||||
- 45 concurrent basho_bench threads (basho_bench run on separate disks) running at max
|
- 45 concurrent basho_bench threads (basho_bench run on a separate machine) running at max
|
||||||
- AAE set to passive
|
- AAE set to passive
|
||||||
- sync writes enabled (on both backends)
|
- sync writes enabled (on both backends)
|
||||||
- An object size of 8KB
|
- An object size of 8KB
|
||||||
- A pareto distribution of requests with a keyspace of 50M keys
|
- A pareto distribution of requests with a keyspace of 200M keys
|
||||||
- 5 GETs for each UPDATE
|
- 5 GETs for each UPDATE
|
||||||
- 4 hour test run
|
- 4 hour test run
|
||||||
|
|
||||||
|
@ -32,11 +32,11 @@ leveled PUT mean(max) | eleveldb PUT mean(max)
|
||||||
:-------------------------:|:-------------------------:
|
:-------------------------:|:-------------------------:
|
||||||
101.5ms | 2,301.6ms
|
101.5ms | 2,301.6ms
|
||||||
|
|
||||||
Tail latency under load is around in leveled is less than 5% of the comparable value in eleveldb (note there is a significant difference in the y-axis scale between the latency charts on these graphs).
|
Tail latency under load is around in leveled is less than 5% of the comparable value in eleveldb.
|
||||||
|
|
||||||
leveled Results | eleveldb Results
|
leveled Results | eleveldb Results
|
||||||
:-------------------------:|:-------------------------:
|
:-------------------------:|:-------------------------:
|
||||||
 | 
|
 | 
|
||||||
|
|
||||||
### Lies, damned lies etc
|
### Lies, damned lies etc
|
||||||
|
|
||||||
|
@ -56,15 +56,77 @@ On the flip side, it could be argued that the 73% difference under-estimates the
|
||||||
|
|
||||||
## Riak Cluster Test - 2
|
## Riak Cluster Test - 2
|
||||||
|
|
||||||
to be completed ..
|
An identical test was run as above, but with d2.2xlarge instances, so that performance on spinning disks could be tested for comparison. This however tanked with sync_on_write enabled regardless if this was tested with leveled or leveldb - with just 1,000 transactions per second supported.
|
||||||
|
|
||||||
As above but on d2.2xlarge EC2 nodes for HDD comparison
|
Although append_only writes are being used, almost every write still requires a disk head movement even if the server had all reads handled by in-memory cache (as there are normally more vnodes on the server than there are disk heads). It is clear that without a Flash-Backed Write Cache, spinning disks are unusable as the sole storage mechanism.
|
||||||
|
|
||||||
|
Also tested was this same d2.2xlarge cluster, but without sync_on_write. Results were:
|
||||||
|
|
||||||
|
leveled Results | eleveldb Results
|
||||||
|
:-------------------------:|:-------------------------:
|
||||||
|
 | 
|
||||||
|
|
||||||
|
This test showed a <b>26.7%</b> improvement in throughput when using leveled. The improvement in tail latency in this test had leveled at about <b>25%</b> of the tail latency of leveldb - so still a substantial improvement, although not of the same order of magnitude as the sync_on_write test with i2.2xlarge and leveldb.
|
||||||
|
|
||||||
|
This indicates that without sync_on_writes, there remains potential value in using HDD drives, with either leveldb and leveled. It seems reasonably likely (although it remains unproven) that spinning drives with a hardware RAID and flash-backed write cache may perform reasonably even with sync on writes.
|
||||||
|
|
||||||
## Riak Cluster Test - 3
|
## Riak Cluster Test - 3
|
||||||
|
|
||||||
to be completed ..
|
Testing with optimised GET FSM (which checks HEAD requests from r nodes, and only GET request from one node if no sibling resolution required).
|
||||||
|
|
||||||
Testing with optimised GET FSM (which checks HEAD requests from r nodes, and only GET request from one node if no sibling resolution required)
|
The test has been based on the following configuration:
|
||||||
|
|
||||||
|
- A 5 node cluster
|
||||||
|
- Using i2.2xlarge EC2 nodes with mirrored SSD drives (for data partition only)
|
||||||
|
- deadline scheduler, transparent huge pages disabled, ext4 partition
|
||||||
|
- A 64 vnode ring-size
|
||||||
|
- 100 concurrent basho_bench threads (basho_bench run on a separate machine) running at max
|
||||||
|
- AAE set to passive
|
||||||
|
- sync writes disabled (on both backends)
|
||||||
|
- An object size of 8KB
|
||||||
|
- A pareto distribution of requests with a keyspace of 200M keys
|
||||||
|
- 5 GETs for each UPDATE
|
||||||
|
- 6 hour test run
|
||||||
|
|
||||||
|
Note the changes from the first cluster test:
|
||||||
|
|
||||||
|
- The recommended disk scheduler was used (the change was mistakenly omitted in the previous test).
|
||||||
|
- Sync_writes were disabled (for comparison purposes with the previous test, as Basho have stated that they don't expect customers to use this feature - although some customers do use this feature).
|
||||||
|
- More basho_bench threads were used to push the extra capacity created by removing the syncing to disk.
|
||||||
|
- The test run was extended to six hours to confirm a change in the divergence between leveled and leveldb which occurred as the test progressed.
|
||||||
|
|
||||||
|
This test showed a <b>12.5%</b> improvement in throughput when using LevelEd, across the whole test. The difference between the backends expands as the test progresses though, so in the last hour of the test there was a <b>29.5%</b> improvement in throughput when using LevelEd.
|
||||||
|
|
||||||
|
There was again a significant improvement in variance in tail latency, although not as large a difference as seen when testing with writing to disks enabled. Through the course of the test the average of the maximum response times (in each 10s period) were
|
||||||
|
|
||||||
|
leveled GET mean(max) | eleveldb GET mean(max)
|
||||||
|
:-------------------------:|:-------------------------:
|
||||||
|
250.3ms | 1,090.1ms
|
||||||
|
|
||||||
|
leveled PUT mean(max) | eleveldb PUT mean(max)
|
||||||
|
:-------------------------:|:-------------------------:
|
||||||
|
369.0ms | 1,817.3ms
|
||||||
|
|
||||||
|
The difference in volatility is clear when graphing the test performance:
|
||||||
|
|
||||||
|
leveled Results | eleveldb Results
|
||||||
|
:-------------------------:|:-------------------------:
|
||||||
|
 | 
|
||||||
|
|
||||||
|
Repeating this test over 8 hours with a smaller keyspace and a uniform (not pareto) key distribution yields similar results. This change in test parameters was done to reduce the percentage of not_founds towards the end of the test (down to about 20%).
|
||||||
|
|
||||||
|
leveled Results | eleveldb Results
|
||||||
|
:-------------------------:|:-------------------------:
|
||||||
|
 | 
|
||||||
|
|
||||||
|
|
||||||
|
### Lies, damned lies etc
|
||||||
|
|
||||||
|
Monitoring the resource utilisation between the two tests makes the difference between the two backends clearer. Throughout the test run with the leveled backend, each node is almost exclusively constrained by CPU, with disk utilisation staying well within limits.
|
||||||
|
|
||||||
|
The leveldb backend can achieve higher throughput when it is constrained by CPU compared to leveled, but as the test runs it becomes constrained by disk busyness with increasing frequency. When leveldb is constrained by disk throughput can still become significantly constrained even without disk syncing enabled (i.e. throughput can halve/double in adjoining 20s measurement periods).
|
||||||
|
|
||||||
|
Leveled is consistently constrained by one factor, giving more predicable throughput. Leveldb performs well when constrained by CPU availability, but is much more demanding on disk, and when disk becomes the constraint throughput becomes volatile.
|
||||||
|
|
||||||
## Riak Cluster Test - 4
|
## Riak Cluster Test - 4
|
||||||
|
|
||||||
|
|
42
docs/WHY.md
|
@ -2,7 +2,7 @@
|
||||||
|
|
||||||
## Why not just use RocksDB?
|
## Why not just use RocksDB?
|
||||||
|
|
||||||
Well that wouldn't be interesting.
|
Well that wouldn't have been as interesting.
|
||||||
|
|
||||||
All LSM-trees which evolve off the back of leveldb are trying to improve leveldb in a particular context. I'm considering larger values, with need for iterators and object time-to-lives, optimisations by supporting HEAD requests and also the problem of running multiple isolated nodes in parallel on a single server.
|
All LSM-trees which evolve off the back of leveldb are trying to improve leveldb in a particular context. I'm considering larger values, with need for iterators and object time-to-lives, optimisations by supporting HEAD requests and also the problem of running multiple isolated nodes in parallel on a single server.
|
||||||
|
|
||||||
|
@ -38,12 +38,46 @@ From the start I decided that fadvise would be my friend, in part as:
|
||||||
|
|
||||||
- It was easier to depend on the operating system to understand and handle LRU management.
|
- It was easier to depend on the operating system to understand and handle LRU management.
|
||||||
|
|
||||||
- I'm expecting various persisted elements (not just values, but also key ranges) to benefit from compression, so want to maximise the volume of data holdable in memory (by caching the compressed view in the page cache).
|
- I'm expecting various persisted elements (not just values, but also key ranges) to benefit from compression, so want to maximise the volume of data which can be held in memory (by caching the compressed view in the page cache).
|
||||||
|
|
||||||
- I expect to reduce the page-cache polluting events that can occur in Riak by reducing object scanning.
|
- I expect to reduce the page-cache polluting events that can occur in Riak by reducing object scanning.
|
||||||
|
|
||||||
Ultimately though, sophisticated memory management is hard, and beyond my capability in the timescale available.
|
Ultimately though, sophisticated memory management is hard, and beyond my capability in the timescale available.
|
||||||
|
|
||||||
|
The design may make some caching strategies relatively easy to implement in the future though. Each file process has its own LoopData, and to that LoopData independent caches can be added. This is currently used for caching bloom filters and hash index tables, but could be used in a more flexible way.
|
||||||
|
|
||||||
|
## Why do the Sequence Numbers not wrap?
|
||||||
|
|
||||||
|
Each update increments the sequence number, but the sequence number does not wrap around at any stage, it will grow forever. The reasoning behind not supporting some sort of wraparound and vacuuming of the sequence number is:
|
||||||
|
|
||||||
|
- There is no optimisation based on fitting into a specific size, so no immediate performance benefit from wrapping;
|
||||||
|
- Auto-vacuuming is hard to achieve, especially without creating a novel and unexpected operational event that is hard to capacity plan for;
|
||||||
|
- The standard Riak node transfer feature will reset the sequence number in the replacement vnode - providing a natural way to address any unexpected issue with the growth in this integer.
|
||||||
|
|
||||||
|
## Why not store pointers in the Ledger?
|
||||||
|
|
||||||
|
The Ledger value contains the sequence number which is necessary to lookup the value in the Journal. However, this lookup will require a tree to be walked to find the file, and then a two-stage hashtree lookup to find the pointer in the file. The location of value is know at the time the Ledger entry is written - so why not include a pointer at this stage? The Journal would then be Bitcask-like relying on external pointers.
|
||||||
|
|
||||||
|
The issue with this mechanism would be the added complexity in journal compaction. When the journal is compacted it would be necessary to make a series of Ledger updates to reflect the changes in pointers - and this would require some sort of locking strategy to prevent a race between an external value update and an update to the pointer.
|
||||||
|
|
||||||
|
The [WiscKey store](https://www.usenix.org/system/files/conference/fast16/fast16-papers-lu.pdf) achieves this, without being explicit about whether it blocks updates to avoid the race.
|
||||||
|
|
||||||
|
The trade-off chosen is that the operational simplicity of decoupling references between the stores is worth the cost of the additional lookup, especially as Riak changes will reduce the relative volume of value fetches (by comparison to metadata/HEAD fetches).
|
||||||
|
|
||||||
## Why make this backwards compatible with OTP16?
|
## Why make this backwards compatible with OTP16?
|
||||||
|
|
||||||
Yes why, why do I have to do this?
|
Yes why, why do I have to do this?
|
||||||
|
|
||||||
|
## Why name things this way?
|
||||||
|
|
||||||
|
The names used in the Actor model are loosely correlated with names used for on-course bookmakers (e.g. Bookie, Clerk, Penciller).
|
||||||
|
|
||||||
|

|
||||||
|
|
||||||
|
There is no strong reason for drawing this specific link, other than the generic sense that this group represents a tight-nit group of workers passing messages from a front-man (the bookie) to keep a local view of state (a book) for a queue of clients, and where normally these groups are working in a loosely-coupled orchestration with a number of other bookmakers to form a betting market that is converging towards an eventually consistent price.
|
||||||
|
|
||||||
|

|
||||||
|
|
||||||
|
There were some broad parallels between bookmakers in a market and vnodes in a Riak database, and using the actor names just stuck, even though the analogy is imperfect. Somehow having a visual model of these mysterious guys in top hats working away, helped me imagine the store in action.
|
||||||
|
|
||||||
|
The name LevelEd owes some part to the influence of my eldest son Ed, and some part to the Yorkshire phrasing of the word Head.
|
BIN
docs/pics/ascot_bookies.jpg
Normal file
After Width: | Height: | Size: 114 KiB |
BIN
docs/pics/betting_market.jpg
Normal file
After Width: | Height: | Size: 15 KiB |
BIN
test/volume/cluster_one/output/summary_leveldb_5n_45t_fixed.png
Normal file
After Width: | Height: | Size: 108 KiB |
BIN
test/volume/cluster_one/output/summary_leveldb_5n_45t_ii.png
Normal file
After Width: | Height: | Size: 109 KiB |
BIN
test/volume/cluster_one/output/summary_leveled_5n_45t_fixed.png
Normal file
After Width: | Height: | Size: 70 KiB |
BIN
test/volume/cluster_one/output/summary_leveled_5n_45t_ii.png
Normal file
After Width: | Height: | Size: 73 KiB |
After Width: | Height: | Size: 114 KiB |
After Width: | Height: | Size: 100 KiB |
After Width: | Height: | Size: 113 KiB |
After Width: | Height: | Size: 99 KiB |
After Width: | Height: | Size: 79 KiB |
After Width: | Height: | Size: 81 KiB |
After Width: | Height: | Size: 76 KiB |
After Width: | Height: | Size: 78 KiB |
BIN
test/volume/cluster_two/output/summary_nosync_d2_leveldb.png
Normal file
After Width: | Height: | Size: 107 KiB |
After Width: | Height: | Size: 99 KiB |
BIN
test/volume/cluster_two/output/summary_nosync_d2_leveldb_ii.png
Normal file
After Width: | Height: | Size: 88 KiB |
BIN
test/volume/cluster_two/output/summary_nosync_d2_leveled.png
Normal file
After Width: | Height: | Size: 94 KiB |
After Width: | Height: | Size: 77 KiB |
BIN
test/volume/cluster_two/output/summary_nosync_d2_leveled_ii.png
Normal file
After Width: | Height: | Size: 76 KiB |