Merge remote-tracking branch 'refs/remotes/origin/master' into mas-tidyup-r01

2017-02-26 18:47:46 +00:00 · 2017-02-26 18:47:46 +00:00 · a732650b0c
commit a732650b0c
parent 5ae93ecb17 435c5463ee
17 changed files with 342 additions and 74 deletions
--- a/README.md
+++ b/README.md
@ -1,24 +1,35 @@
-# LevelEd - An Erlang Key-Value store
+# Leveled - An Erlang Key-Value store

 ## Introduction

-LevelEd is a work-in-progress prototype of a simple Key-Value store based on the concept of Log-Structured Merge Trees, with the following characteristics:
+Leveled is a work-in-progress prototype of a simple Key-Value store based on the concept of Log-Structured Merge Trees, with the following characteristics:

 - Optimised for workloads with larger values (e.g. > 4KB).
- Supporting of values split between metadata and body, and allowing for HEAD requests which have lower overheads than GET requests.
+
+- Explicitly supports HEAD requests in addition to GET requests. 
+  - Splits the storage of value between key/metadata and body, 
+  - allowing for HEAD requests which have lower overheads than GET requests, and
+  - queries which traverse keys/metadatas to be supported with fewer side effects on the page cache.
+
 - Support for tagging of object types and the implementation of alternative store behaviour based on type.
- Support for low-cost clones without locking to provide for scanning queries, specifically where there is a need to scan across keys and metadata (not values).
+  - Potentially usable for objects with special retention or merge properties.
+
+- Support for low-cost clones without locking to provide for scanning queries.
+  - Low cost specifically where there is a need to scan across keys and metadata (not values).
+
 - Written in Erlang as a message passing system between Actors.

-The store has been developed with a focus on being a potential backend to a Riak KV database, rather than as a generic store.  The primary aim of developing (yet another) Riak backend is to examine the potential to reduce the broader costs providing sustained throughput in Riak i.e. to provide equivalent throughput on cheaper hardware.  It is also anticipated in having a fully-featured pure Erlang backend may assist in evolving new features through the Riak ecosystem  which require end-to-end changes.
+The store has been developed with a focus on being a potential backend to a Riak KV database, rather than as a generic store.  

-The store is not expected to be 'faster' than leveldb (the primary fully-featured Riak backend available today), but it is likely in some cases to offer improvements in throughput.
+The primary aim of developing (yet another) Riak backend is to examine the potential to reduce the broader costs providing sustained throughput in Riak i.e. to provide equivalent throughput on cheaper hardware.  It is also anticipated in having a fully-featured pure Erlang backend may assist in evolving new features through the Riak ecosystem  which require end-to-end changes, rather than requiring context switching between C++ and Erlang based components.
+
+The store is not expected to offer lower median latency than the Basho-enhanced leveldb, but it is likely in some cases to offer improvements in throughput, reduced tail latency and reduced volatility in performance.  It is expected that the likelihood of finding improvement will correlate with the average object size, and inversely correlate with the availability of Disk IOPS in the hardware configuration.

 ## More Details

 For more details on the store:

- An [introduction](docs/INTRO.md) to LevelEd covers some context to the factors motivating design trade-offs in the store.
+- An [introduction](docs/INTRO.md) to Leveled covers some context to the factors motivating design trade-offs in the store.

 - The [design overview](docs/DESIGN.md) explains the actor model used and the basic flow of requests through the store.

@ -28,10 +39,63 @@ For more details on the store:

 ## Is this interesting?

-At the initiation of the project I accepted that making a positive contribution to this space is hard - given the superior brainpower and experience of those that have contributed to the KV store problem space in general, and the Riak backend space in particular.
+Making a positive contribution to this space is hard - given the superior brainpower and experience of those that have contributed to the KV store problem space in general, and the Riak backend space in particular.

-The target at inception was to do something interesting, something that articulates through working software the potential for improvement to exist by re-thinking certain key assumptions and trade-offs.
+The target at inception was to do something interesting, to re-think certain key assumptions and trade-offs, and prove through working software the potential for improvements to be realised.

-[Initial volume tests](docs/VOLUME.md) indicate that it is at least interesting, with substantial improvements in both throughput (73%) and tail latency (1:20) when compared to eleveldb - in scenarios expected to be optimised for leveled.  Note, to be clear, this is statement falls well short of making a general claim that it represents a 'better' Riak backend than leveldb.
+[Initial volume tests](docs/VOLUME.md) indicate that it is at least interesting.  With improvements in throughput for multiple configurations, with this improvement becoming more marked as the test progresses (and the base data volume becomes more realistic).  

-More information can be found in the [volume testing section](docs/VOLUME.md).
+The delta in the table below  is the comparison in Riak performance between the identical test run with a Leveled backend in comparison to Leveldb.
+
+Test Description                  | Hardware     | Duration |Avg TPS    | Delta (Overall)  | Delta (Last Hour)
+:---------------------------------|:-------------|:--------:|----------:|-----------------:|-------------------:
+8KB value, 60 workers, sync       | 5 x i2.2x    | 4 hr     | 12,679.91 | <b>+ 70.81%</b>  | <b>+ 63.99%</b>
+8KB value, 100 workers, no_sync   | 5 x i2.2x    | 6 hr     | 14,100.19 | <b>+ 16.15%</b>  | <b>+ 35.92%</b>
+8KB value, 50 workers, no_sync    | 5 x d2.2x    | 4 hr     | 10,400.29 | <b>+  8.37%</b>  | <b>+ 23.51%</b> 
+
+Tests generally show a 5:1 improvement in tail latency for LevelEd.
+
+All tests have in common:
+
+- Target Key volume - 200M with pareto distribution of load
+- 5 GETs per 1 update 
+- RAID 10 (software) drives
+- allow_mult=false, lww=false
+- modified riak optimised for leveled used in leveled tests
+
+
+The throughput in leveled is generally CPU-bound, whereas in comparative tests for leveledb the throughput was disk bound.  This potentially makes capacity planning simpler, and opens up the possibility of scaling out to equivalent throughput at much lower cost (as CPU is relatively low cost when compared to disk space at high I/O) - [offering better alignment between resource constraints and the cost of resource](docs/INTRO.md).
+
+More information can be found in the [volume testing section](docs/VOLUME.md).
+
+As a general rule though, the most interesting thing is the potential to enable [new features](docs/FUTURE.md).  The tagging of different object types, with an ability to set different rules for both compaction and metadata creation by tag, is a potential enabler for further change.   Further, having a separate key/metadata store which can be scanned without breaking the page cache or working against mitigation for write amplifications, is also potentially an enabler to offer features to both the developer and the operator.
+
+## Next Steps
+
+Further volume test scenarios are the immediate priority, in particular volume test scenarios with:
+
+- Alternative object sizes;
+
+- Significant use of secondary indexes;
+
+- Use of newly available [EC2 hardware](https://aws.amazon.com/about-aws/whats-new/2017/02/now-available-amazon-ec2-i3-instances-next-generation-storage-optimized-high-i-o-instances/) which potentially is a significant changes to assumptions about hardware efficiency and cost.
+
+- Create riak_test tests for new Riak features enabled by Leveled.
+
+## Feedback
+
+Please create an issue if you have any suggestions.  You can ping me @masleeds if you wish
+
+## Running Leveled
+
+Unit and current tests in leveled should run with rebar3.  Leveled has been tested in OTP18, but it can be started with OTP16 to support Riak (although tests will not work as expected).  A new database can be started by running
+
+```
+{ok, Bookie} = leveled_bookie:book_start(RootPath, LedgerCacheSize, JournalSize, SyncStrategy)   
+```
+
+This will start a new Bookie.  It will start and look for existing data files, under the RootPath, and start empty if none exist.  A LedgerCacheSize of 2000, a JournalSize of 500000000 (500MB) and a SyncStrategy of recovr should work OK.
+
+The book_start method should respond once startup is complete.  The leveled_bookie module includes the full API for external use of the store.
+
+Read through the [end_to_end test suites](test/end_to_end/) for further guidance.
--- a/docs/DESIGN.md
+++ b/docs/DESIGN.md
@ -44,9 +44,11 @@ Both the Inker and the Penciller must undertake compaction work.  The Inker must

 Both the Penciller and the Inker each make use of their own dedicated clerk for completing this work.  The Clerk will add all new files necessary to represent the new view of that part of the store, and then update the Inker/Penciller with the new manifest that represents that view.  Once the update has been acknowledged, any removed files can be marked as delete_pending, and they will poll the Inker (if a Journal file) or Penciller (if a Ledger file) for it to confirm that no clones of the system still depend on the old snapshot of the store to be maintained.

+The Penciller's clerk work to compact the Ledger is effectively continuous, it will regularly poll the Penciller for compaction work, and do work whenever work is necessary.  The Inker's clerk compaction responsibility is expected to be scheduled.  The Inker's Clerk will need to determine if compaction work is necessary when scheduled, and it does this by asking each CDB file for a random sample of keys and object sizes, and the file is the scored by the Clerk for compaction based on the indication from the sample of the likely proportion of space that will be recovered.  The Inker's Clerk will compact only the most rewarding 'run' of journal files, if and only if the most rewarding run surpasses a threshold score.
+
 ### File Clerks

-Every file within the store has is owned by its own dedicated process (modelled as a finite state machine).  Files are never created or accessed by the Inker or the Penciller, interactions with the files are managed through messages sent to the File Clerk processes which own the files.
+Every file within the store has its own dedicated process (modelled as a finite state machine).  Files are never created or accessed by the Inker or the Penciller, interactions with the files are managed through messages sent to the File Clerk processes which own the files.

 The File Clerks themselves are ignorant to their context within the store.  For example a file in the Ledger does not know what level of the Tree it resides in.  The state of the store is represented by the Manifest which maintains a picture of the store, and contains the process IDs of the file clerks which represent the files.

@ -58,15 +60,29 @@ File clerks spend a short initial portion of their life in a writable state.  On

 ## Clones

-Both the Penciller and the Inker can be cloned, to provide a snapshot of the database at a point in time for long-running process that can be run concurrently to other database actions.  Clones are used for Journal compaction, but also for scanning queries in the penciller (for example to support 2i queries or hashtree rebuilds in Riak).
+Both the Penciller and the Inker can be cloned, to provide a snapshot of the database at a point in time. A long running process may then use this clone to query the database concurrently to other database actions.  Clones are used for Journal compaction, but also for scanning queries in the penciller (for example to support 2i queries or hashtree rebuilds in Riak).

-The snapshot process is simple.  The necessary loop-state is requested from the real worker, in particular the manifest and any immutable in-memory caches, and a new gen_server work is started with the loop state.  The clone registers itself as a snapshot with the real worker, with a timeout that will allow the snapshot to expire if the clone silently terminates.  The clone will then perform its work, making requests to the file workers referred to in the manifest.  Once the work is complete the clone should remove itself from the snapshot register in the real worker before closing.
+The snapshot process is simple.  The necessary loop-state is requested from the real worker, in particular the manifest and any immutable in-memory caches, and a new gen_server worker is started with the loop state.  The clone registers itself as a snapshot with the real worker, with a timeout that will allow the snapshot to expire if the clone silently terminates.  The clone will then perform its work, making requests to the file workers referred to in the manifest.  Once the work is complete the clone should remove itself from the snapshot register in the real worker before closing.

-The snapshot register is used by the real worker when file workers are placed in the delete_pending state after they have been replaced in the current manifest.  Checking the list of registered snapshots allows the Penciller or Inker to inform the File Clerk if they remove themselves permanently - as no remaining clones may expect them to be present (except those who have timed out).
+The snapshot register is used by the real worker when file workers are placed in the delete_pending state. File processes enter this state when they have been replaced in the current manifest, but access may still be needed by a cloned process using an older version of the manifest. The file process should poll the Penciller/Inker in this state to check if deletion can be completed, and the Penciller/Inker should check the register of snapshots to confirm that no active snapshot has a potential dependency on the file before responding to proceed with deletion.

 Managing the system in this way requires that ETS tables are used sparingly for holding in-memory state, and immutable and hence shareable objects are used instead.  The exception to the rule is the small in-memory state of recent additions kept by the Bookie - which must be exported to a list on every snapshot request.

-## Paths
+## Types
+
+Objects in LevelEd all have a type (or tag).  The tag determines how the object will be compacted in both the Ledger and Journal, and what conversions are applied when fetching the object or when splitting the object metadata to convert from the Inker to the Journal.
+
+Three types are initially supported:
+
+- o - standard object
+
+- o_rkv - riak object
+
+- i - index term
+
+All Ledger Keys created for any type must be 4-tuples starting with the tag.  Abstraction with regards to types is currently imperfect, but the expectation is that these types will make support for application specific behaviours easier to achieve, such as behaviours which maybe required to support different [CRDTs](https://en.wikipedia.org/wiki/Conflict-free_replicated_data_type).
+
+## GET/PUT Paths

 The PUT path for new objects and object changes depends on the Bookie interacting with the Inker to ensure that the change has been persisted with the Journal, the Ledger is updated in batches after the PUT has been completed.

@ -76,3 +92,33 @@ The GET path follows the HEAD path, but once the sequence number has been determ

 All other queries (folds over indexes, keys and objects) are managed by cloning either the Penciller, or the Penciller and the Inker.

+## Startup
+
+On startup the Bookie will start, and first prompt the Inker to startup.  the Inker will load the manifest and attempt to start a process for each file in the manifest, and then update the manifest with those process IDs.  The Inker will also look on disk for any CDB files more recent than the manifest, and will open those at the head of the manifest.  Once Inker startup is complete, the Bookie will prompt the Penciller to startup.
+
+The Penciller will, as with the Inker, open the manifest and then start a process for each file in the manifest to open those files.  The Penciller will also look for a persisted level-zero file which was more recent than the current manifest and open a process for this too.  As part of this process the Penciller determines the highest sequence number that has been seen in the persisted Ledger.  Once the highest sequence number has determined the Penciller will request a reload from that sequence number.
+
+To complete the reload request, the Inker will scan all the Keys/Values in the Journal from that sequence number (in batches), converting the Keys and Values into the necessary Ledger changes, and loading those changes into the Ledger via the Penciller.
+
+Once the Ledger is up-to-date with the very latest changes in the Journal, the Bookie signals that startup is complete and it is ready to receive load.
+
+## Recovery
+
+Theoretically, the entire Ledger can be restored from the Journal, if the Ledger were erased and then the Bookie was started as normal.  This was intended as a potential simplified backup solution, as this would require only the slowly mutating Journal to be backed up, and so would be suited to an incremental delta-based backup strategy and restoring to points in time.  There are some potential issues with this:
+
+- It is not clear how long scanning over the Journal would take with an empty ledger, and it would certainly be too long for the standard Riak vnode start timeout.
+
+- Some changes build over history, and so would be lost if earlier changes are lost due to compaction of out-of-date values (e.g. 2i index changes in Riak).
+
+- Journal compaction becomes more complicated if it is necessary to recover in a scenario where the head of the Journal has been corrupted (so if we have B.0, B.1 and B.2 as three versions of the object, if B.2 is lost due to corruption and B.1 is compacted then B.0 will be resurrected on recovery - so post compaction there is no point in time recovery possible).
+
+Three potential recovery strategies are supported to provide some flexibility for dealing with this:
+
+- recovr - if a value is out of date at the time of compaction, remove it.  This strategy assumes anomalies can be recovered by a global anti-entropy strategy, or that the persisted part of the Ledger is never lost (the Inker's Clerk will never compact a value received after the highest persisted sequence number in the Ledger).
+
+- retain - on compaction KeyDeltas are retained in the Journal, only values are removed.
+
+- recalc (not yet implemented) - the compaction rules assume that on recovery the key changes will be recalculated by comparing the change with the current database state.
+
+
+
--- a/docs/FUTURE.md
+++ b/docs/FUTURE.md
@ -6,12 +6,18 @@ The store supports all the required Riak backend capabilities.  A number of furt

 - Support for "Tags" in keys to allow for different treatment of different key types (i.e. changes in merge actions, metadata retention, caching for lookups).

+- Support for Sub-Keys in objects.  There may be the potential to use this along with key types to support alternative approaches to CRDT sets or local rather than global secondary indexes.
+
 - A fast "list bucket" capability, borrowing from the HanoiDB implementation, to allow for safe bucket listing in production.

- A bucket size query, which requires traversal only of the Ledger and counts the keys and sums he total on-disk size of the objects within the bucket.
+- A bucket size query, which requires traversal only of the Ledger and counts the keys and sums the total on-disk size of the objects within the bucket.
+
+- A new SW count to operate in parallel to DW, where SW is the number of nodes required to have flushed to disk (not just written to buffer); with SW being configurable to 0 (no vnodes will need to sync this write), 1 (the PUT coordinator will sync, but no other vnodes) or all (all vnodes will be required to sync this write before providing a DW ack).  This will allow the expensive disk sync operation to be constrained to the writes for which it is most needed. 

 - Support for a specific Riak tombstone tag where reaping of tombstones can be deferred (by many days) i.e. so that a 'sort-of-keep' deletion strategy can be followed that will eventually garbage collect without the need to hold pending full deletion state in memory.

+- The potential to support Map/Reduce functions on object metadata not values, so that cache-pollution and disk i/o overheads of full object Map/Reduce can be eliminated by smarter object construction strategies that promote information designed for queries from values into metadata.
+

 ## Outstanding work

@ -29,4 +35,83 @@ There is some work required before LevelEd could be considered production ready:

 - Improved handling of corrupted files.

- A way of identifying the partition in each log to ease the difficulty of tracing activity when multiple stores are run in parallel.
+- A way of identifying the partition in each log to ease the difficulty of tracing activity when multiple stores are run in parallel.
+
+
+## Riak Features Implemented
+
+The following Riak features have been implemented
+
+### Leveled Backend
+
+Branch: [mas-leveleddb](https://github.com/martinsumner/riak_kv/tree/mas-leveleddb)
+
+Branched-From: [Basho/develop](https://github.com/basho/riak_kv)
+
+Description: 
+
+The leveled backend has been implemented with some basic manual functional tests.  The backend has the following capabilities:
+
+- async_fold - can return a Folder() function in response to 2i queries;
+- indexes - support for secondary indexes (currently being returned in the wrong order);
+- head - supports a HEAD as well as a GET request; 
+- hash_query - can return keys/hashes when rebuilding hashtrees, so implementation of hashtree can call this rather than folding objects and hashing the objects returning from the fold.
+
+All standard features should (in theory) be supportable (e.g. TTL, M/R, term_regex etc) but aren't subject to end-to_end testing at present.
+
+There is a uses_r_object capability that leveled should in theory be able to support (making it unnecessary to convert to/from binary form).  However, given that only the yessir backend currently supports this, the feature hasn't been implemented in case there are potential issues lurking within Riak when using this capability.
+
+### Fast List-Buckets
+
+Normally Basho advise not to list buckets in production - as on other backends it is a hugely expensive operation.  Accidentally running list buckets can lead to significant resource pressure, and also rapid consumption of available disk space.
+
+The LevelEd backend borrows from the hanoidb implementation of list buckets, which supports listing buckets with a simple low-cost async fold.  So there should be no issue listing buckets in production with these backends.  
+
+Note - the technique would work in leveldb and memory backends as well (and perhaps bitcask), it just isn't implemented there.  Listing buckets should not really be an issue.
+
+### GET_FSM -> Using HEAD
+
+Branch: [mas-leveled-getfsm](https://github.com/martinsumner/riak_kv/tree/mas-leveled-getfsm)
+
+Branched-From: [mas-leveleddb](https://github.com/martinsumner/riak_kv/tree/mas-leveleddb)
+
+Description:
+
+In standard Riak the Riak node that receives a GET request starts a riak_kv_get_fsm to handle that request.  This FSM goes through the following primary states:
+
+- prepare (get preflists etc)
+- valiidate (make sure there is sufficient vnode availability to handle the request)
+- execute (send the GET request to n vnodes)
+- waiting_vnode_r (loop waiting for the response from r vnodes, every time a response is received check if enough vnodes have responded, and then either finalise a response or wait for more)
+- waiting_read_repair (read repair if a responding vnode is out of date)
+
+The alternative FSM in this branch makes the following changes
+- sends HEAD not GET requests at the execute stage
+- when 'enough' is received in waiting_vnode_r, elects a vnode or vnodes from which to fetch the body
+- recall execute asking for just the necessary GET requests
+- in waiting_vnode_r the second time update HEAD responses to GET responses, and recalculate the winner (including body) when all vnodes which have been sent a GET request have responded
+
+So rather than doing three Key/Metadata/Body backend lookups for every request, normally (assuming no siblings) Riak now requires three Key/Metadata and one Key/Metadata/Body lookup.  For larger bodies, this then avoids the additional disk activity and network load associated with fetching and transmitting the body three times for every request.  There are some other side effects:
+
+- It is more likely to detect out-of-date objects in slow nodes (as the n-r responses may still be received and added to the result set when waiting for the GET response in the second loop).
+- The database is in-theory much less likely to have [TCP Incast](http://www.snookles.com/slf-blog/2012/01/05/tcp-incast-what-is-it/) issues, hence reducing the cost of network infrastructure, and reducing the risk of running Riak in shared infrastructure environments.
+
+The feature will not at present work safely with legacy vclocks. This branch generally relies on vector clock comparison only for equality checking, and removes some of the relatively expensive whole body equality tests (either as a result of set:from_list/1 or riak_object:equal/2), which are believed to be a legacy of issues with pre-dvv clocks.
+
+In tests, the benefit of this may not be that significant - as the primary resource saved is disk/network, and so if these are not the resources under pressure, the gain may not be significant.  In tests bound by CPU not disks, only a 10% improvement has so far been measured with this feature.
+
+### PUT -> Using HEAD
+
+Branch: [mas-leveled-putfsm](https://github.com/martinsumner/riak_kv/tree/mas-leveled-putfsm)
+
+Branched-From: [mas-leveled-getfsm](https://github.com/martinsumner/riak_kv/tree/mas-leveled-getfsm)
+
+Description:
+
+The standard PUT process for Riak requires the PUT to be forwarded to a coordinating vnode first.  The coordinating PUT  process requires the object to be fetched from the local vnode only, and a new updated Object created.  The remaining n-1 vnodes are then sent the update object as a PUT, and once w/dw/pw nodes have responded the PUT is acknowledged.
+
+The other n-1 vnodes must also do a local GET before the vnode PUT (so as not to erase a more up-to-date value that may not have been present at the coordinating vnode).
+
+This branch changes the behaviour slightly at the non-coordinating vnodes.  These vnodes will now try a HEAD request before the local PUT (not a GET request), and if the HEAD request contains a vclock which is <strong>dominated</strong> by the updated PUT, it will not attempt to fetch the whole object for the syntactic merge.
+
+This should save two object fetches (where n=3) in most circumstances.
--- a/docs/INTRO.md
+++ b/docs/INTRO.md
@ -4,7 +4,7 @@ The following section is a brief overview of some of the motivating factors behi

 ## A Paper To Love

-The concept of a Log Structured Merge Tree is described within the 1996 paper ["The Log Structured Merge Tree"](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.44.2782&rep=rep1&type=pdf) by Patrick O'Neil et al.  The paper is not specific on precisely how a LSM-Tree should be implemented, proposing a series of potential options.  The paper's focus is on framing the justification for design decisions in the context of hardware economics.  
+The concept of a Log Structured Merge Tree is described within the 1996 paper ["The Log Structured Merge Tree"](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.44.2782&rep=rep1&type=pdf) by Patrick O'Neil et al.  The paper is not specific on precisely how an LSM-Tree should be implemented, proposing a series of potential options.  The paper's focus is on framing the justification for design decisions in the context of hardware economics.  

 The hardware economics at the time of the paper were:

@ -24,13 +24,13 @@ Based on the experience of running Riak at scale in production for the NHS, what

 The purchase costs of disk though, do not accurately reflect the running costs of disks - because disks fail, and fail often.  Also the hardware choices made by the NHS for the Spine programme, do not necessarily reflect the generic industry choices and costs.

-To get an up-to-date objective measure of the overall exists are, assuming data centres are run with a high degree of efficiency and savings of scale - the price list for Amazon EC2 instances can assist, by directly comparing the current prices of different instances.
+To get an up-to-date and objective measure of what the overall costs are, the Amazon price list can assist if we assume their data centres are managed with a high-degree of efficiency due to their scale.  Assumptions on the costs of individual components can be made by examining differences between specific instance prices.

 As at 26th January 2017, the [on-demand instance pricing](https://aws.amazon.com/ec2/pricing/on-demand/) for servers is:

 - c4.4xlarge - 16 CPU, 30 GB RAM, EBS only - $0.796

- r4.xlarge  - 4 CPU, 30GB RAM, EBS only - $0.266
+- r4.xlarge  - 4 CPU, 30 GB RAM, EBS only - $0.266

 - r4.4xlarge - 16 CPU, 122 GB RAM, EBS only - $1.064

@ -54,7 +54,9 @@ Compared to the figures at the time of the LSM-Tree paper, the actual delta in t

 The availability of SDDs is not a silver bullet to disk i/o problems when cost is considered, as although they eliminate the additional costs of random page access through the removal of the disk head movement overhead (of about 6.5ms per shift), this benefit is at an order of magnitude difference in cost compared to spinning disks, and at a cost greater than half the price of DRAM.  SSDs have not taken the problem of managing the overheads of disk persistence away, they've simply added another dimension to the economic profiling problem.

-In physical on-premise server environments there is also commonly the cost of disk controllers.  Disk controllers bend the economics of persistence through the presence of flash-backed write caches.  However, disk controllers also fail - within the NHS environment disk controller failures are the second most common device failure after individual disks.  Failures of disk controllers are also expensive to resolve, not being hot-pluggable like disks, and carrying greater risk of node data-loss due to either bad luck or bad process during the change.  It is noticeable that EC2 does not have disk controllers and given their failure rate and cost of recovery, this appears to be a sensible trade-off.
+In physical on-premise server environments there is also commonly the cost of disk controllers.  Disk controllers bend the economics of persistence through the presence of flash-backed write caches.  However, disk controllers also fail - within the NHS environment disk controller failures are the second most common device failure after individual disks.  Failures of disk controllers are also expensive to resolve, not being hot-pluggable like disks, and carrying greater risk of node data-loss due to either bad luck or bad process during the change.  
+
+It is noticeable that EC2 does not have disk controllers and given their failure rate and cost of recovery, this appears to be a sensible trade-off.  However, software-only RAID has drawbacks, include the time to setup RAID (24 hours on a d2.2xlarge node), recover from a disk failure and the time to run [scheduled checks](https://www.thomas-krenn.com/en/wiki/Mdadm_checkarray).

 Making cost-driven decisions about storage design remains as relevant now as it was two decades ago when the LSM-Tree paper was published, especially as we can now directly see those costs reflected in hourly resource charges.

@ -64,9 +66,9 @@ The evolution of leveledb in Riak, from the original Google-provided store to th

 The original leveledb considered in part the hardware economics of the phone where there are clear constraints around CPU usage - due to both form-factor and battery life, and where disk space may be at a greater premium than disk IOPS.  Some of the evolution of eleveldb is down to the Riak-specific problem of needing to run multiple stores on a single server, where even load distribution may lead to a synchronisation of activity.  Much of the evolution is also about how to make better use of the continuous availability of CPU resource, in the face of the relative scarcity of disk resource.  Changes such as overlapping files at level 1, hot threads, compression improvements etc all move eleveldb in the direction of being easier on disk at the cost of CPU; and the hardware economics of servers would indicate this is a wise choice

-### Planning for LevelEd
+### Planning for Leveled

-The primary design differentiation between LevelEd and LevelDB is the separation of the key store (known as the Ledger in LevelEd) and the value store (known as the journal).  The Journal is like a continuous extension of the nursery log within LevelDB, only with a gradual evolution into [CDB files](https://en.wikipedia.org/wiki/Cdb_(software)) so that file offset pointers are not required to exist permanently in memory.  The Ledger is a merge tree structure, with values substituted with metadata and a sequence number - where the sequence number can be used to find the value in the Journal.
+The primary design differentiation between LevelEd and LevelDB is the separation of the key store (known as the Ledger in Leveled) and the value store (known as the journal).  The Journal is like a continuous extension of the nursery log within LevelDB, only with a gradual evolution into [CDB files](https://en.wikipedia.org/wiki/Cdb_(software)) so that file offset pointers are not required to exist permanently in memory.  The Ledger is a merge tree structure, with values substituted with metadata and a sequence number - where the sequence number can be used to find the value in the Journal.

 This is not an original idea, the LSM-Tree paper specifically talked about the trade-offs of placing identifiers rather than values in the merge tree:

@ -82,7 +84,7 @@ So the hypothesis that separating Keys and Values may be optimal for LSM-Trees i

 ## Being Operator Friendly

-The LSM-Tree paper focuses on hardware trade-offs in database design.  LevelEd is focused on the job of being a backend to a Riak database, and the Riak database is opinionated on the trade-off between developer and operator productivity.  Running a Riak database imposes constraints and demands on developers - there are things the developer needs to think hard about: living without transactions, considering the resolution of siblings, manual modelling for query optimisation.  
+The LSM-Tree paper focuses on hardware trade-offs in database design.  Leveled is focused on the job of being a backend to a Riak database, and the Riak database is opinionated on the trade-off between developer and operator productivity.  Running a Riak database imposes constraints and demands on developers - there are things the developer needs to think hard about: living without transactions, considering the resolution of siblings, manual modelling for query optimisation.  

 However, in return for this pain there is great reward, a reward which is gifted to the operators of the service.  Riak clusters are reliable and predictable, and operational processes are slim and straight forward - preparation for managing a Riak cluster in production needn't go much beyond rehearsing magical cure-alls of the the node stop/start and node join/leave processes.  At the NHS, where we have more than 50 Riak nodes in 24 by 365 business critical operations, it is not untypical to go more than 28-days without anyone logging on to a database node.  This is a relief for those of us who have previously lived in the world with databases with endless configuration parameters to test or blame for issues, where you always seem to be the unlucky one who suffer the outages "never seen in any other customer", where the databases come with ever more complicated infrastructure dependencies and and where DBAs need to be constantly at-hand to analyse reports, kill rogue queries and re-run the query optimiser as an answer to the latest 'blip' in performance.

--- a/docs/VOLUME.md
+++ b/docs/VOLUME.md
@ -4,87 +4,124 @@

 Initial volume tests have been [based on standard basho_bench eleveldb test](../test/volume/single_node/examples) to run multiple stores in parallel on the same node and and subjecting them to concurrent pressure. 

-This showed a [relative positive performance for leveled](VOLUME_PRERIAK.md) for both population and load. This also showed that although the LevelEd throughput was relatively stable it was still subject to fluctuations related to CPU constraints.  Prior to moving on to full Riak testing, a number of changes where then made to LevelEd to reduce the CPU load in particular during merge events.
+This showed a [relative positive performance for leveled](VOLUME_PRERIAK.md) for both population and load. This also showed that although the leveled throughput was relatively stable, it was still subject to fluctuations related to CPU constraints - especially as compaction of the ledger was a CPU intensive activity.  Prior to moving on to full Riak testing, a number of changes where then made to leveled to reduce the CPU load during these merge events.

-## Riak Cluster Test - 1
+## Initial Riak Cluster Tests

-The First test on a Riak Cluster has been based on the following configuration:
+Testing in a Riak cluster, has been based on like-for-like comparisons between leveldb and leveled - except that leveled was run within a Riak [modified](FUTURE.md) to use HEAD not GET requests when HEAD requests are sufficient.  
+
+The initial testing was based on simple gets and updates, with 5 gets for every update.  
+
+### Basic Configuration - Initial Tests
+
+The configuration consistent across all tests is:

 - A 5 node cluster
- Using i2.2xlarge EC2 nodes with mirrored SSD drives (for data partition only)
- noop scheduler, transparent huge pages disabled, ext4 partition
- A 64 vnode ring-size
- 45 concurrent basho_bench threads (basho_bench run on separate disks) running at max
+- Using i2.2xlarge EC2 or d2.2xlarge nodes with mirrored/RAID10 drives (for data partition only)
+- deadline scheduler, transparent huge pages disabled, ext4 partition
+- A 64 partition ring-size
 - AAE set to passive
- sync writes enabled (on both backends)
- An object size of 8KB
- A pareto distribution of requests with a keyspace of 50M keys
+- A pareto distribution of requests with a keyspace of 200M keys
 - 5 GETs for each UPDATE
- 4 hour test run

-This test showed a <b>73.9%</b> improvement in throughput when using LevelEd, but more importantly a huge improvement in variance in tail latency.  Through the course of the test the average of the maximum response times (in each 10s period) were
+### Mid-Size Object, SSDs, Sync-On-Write 

-leveled GET mean(max)           | eleveldb GET mean(max)
+This test has the following specific characteristics
+
+- An 8KB value size (based on crypto:rand_bytes/1 - so cannot be effectively compressed)
+- 60 concurrent basho_bench workers running at 'max'
+- i2.2xlarge instances
+- allow_mult=false, lww=false
+
+Comparison charts for this test:
+
+Riak + leveled             |  Riak + eleveldb
 :-------------------------:|:-------------------------:
-21.7ms | 410.2ms
+![](../test/volume/cluster_one/output/summary_leveled_5n_60t_i2_sync.png "LevelEd")  |  ![](../test/volume/cluster_one/output/summary_leveldb_5n_60t_i2_sync.png "LevelDB")

-leveled PUT mean(max)           | eleveldb PUT mean(max)
+### Mid-Size Object, SSDs, No Sync-On-Write 
+
+This test has the following specific characteristics
+
+- An 8KB value size (based on crypto:rand_bytes/1 - so cannot be effectively compressed)
+- 100 concurrent basho_bench workers running at 'max'
+- i2.2xlarge instances
+- allow_mult=false, lww=false
+
+Comparison charts for this test:
+
+Riak + leveled             |  Riak + eleveldb
 :-------------------------:|:-------------------------:
-101.5ms | 2,301.6ms
+![](../test/volume/cluster_two/output/summary_leveled_5n_100t_i2_nosync.png "LevelEd")  |  ![](../test/volume/cluster_two/output/summary_leveldb_5n_100t_i2_nosync.png "LevelDB")

-Tail latency under load is around in leveled is less than 5% of the comparable value in eleveldb (note there is a significant difference in the y-axis scale between the latency charts on these graphs).
+### Mid-Size Object, HDDs, No Sync-On-Write 

-leveled Results           |  eleveldb Results
+This test has the following specific characteristics
+
+- An 8KB value size (based on crypto:rand_bytes/1 - so cannot be effectively compressed)
+- 50 concurrent basho_bench workers running at 'max'
+- d2.2xlarge instances
+- allow_mult=false, lww=false
+
+Comparison charts for this test:
+
+Riak + leveled           |  Riak + eleveldb
 :-------------------------:|:-------------------------:
-![](../test/volume/cluster_one/output/summary_leveled_5n_45t.png "LevelEd")  |  ![](../test/volume/cluster_one/output/summary_leveldb_5n_45t.png "LevelDB")
+![](../test/volume/cluster_three/output/summary_leveled_5n_50t_d2_nosync.png "LevelEd")  |  ![](../test/volume/cluster_three/output/summary_leveldb_5n_50t_d2_nosync.png "LevelDB")
+
+Note that there is a clear inflexion point when throughput starts to drop sharply at about the hour mark into the test.  
+This is the stage when the volume of data has begun to exceed the volume supportable in cache, and so disk activity begins to be required for GET operations with increasing frequency.
+
+### Half-Size Object, SSDs, No Sync-On-Write 
+
+to be completed
+
+### Double-Size Object, SSDs, No Sync-On-Write 
+
+to be completed

 ### Lies, damned lies etc

-To a certain extent this should not be too unexpected - leveled is design to reduce write amplification, without write amplification the persistent write load gives leveled an advantage.  The frequent periods of poor performance in leveldb appear to be coordinated with periods of very high await times on nodes during merge jobs, which may involve up to o(1GB) of write activity.
+The first thing to note about the test is the impact of the pareto distribution and the start from an empty store, on what is actually being tested.  At the start of the test there is a 0% chance of a GET request actually finding an object.  Normally, it will be 3 hours into the test before a GET request will have a 50% chance of finding an object.

-Also the 5:1 ratio of GET:UPDATE is not quite that as:
+![](../test/volume/cluster_two/output/NotPresentPerc.png "Percentage of GET requests being found at different leveled levels")

- each UPDATE requires an external Riak GET (as well as the internal GETs);
+Both leveled and leveldb are optimised for finding non-presence through the use of bloom filters, so the comparison is not unduly influenced by this.  However, the workload at the end of the test is both more realistic (in that objects are found), and harder if the previous throughput had been greater (in that more objects are found).  

- the empty nature of the database at the test start means that there are no actual value fetches initially (just not-present response) and only 50% of fetches get a value by the end of the test (much less for leveldb as there is less volume PUT during the test).
+So it is better to focus on the results at the tail of the tests, as at the tail the results are a more genuine reflection of behaviour against the advertised test parameters.

-When testing on a single node cluster (with a smaller ring size, and a smaller keyspace) the relative benefit of leveled appears to be much smaller.  One big difference between the single node and multi-node testing undertaken is that between the tests the disk was switched from using a single drive to using a mirrored pair.  It is suspected that the amplified improvement between single-node test and multi-node tests is related in-part to the cost of software-based mirroring exaggerating write contention to disk.
+Test Description                  | Hardware     | Duration |Avg TPS    | Delta (Overall)  | Delta (Last Hour)
+:---------------------------------|:-------------|:--------:|----------:|-----------------:|-------------------:
+8KB value, 60 workers, sync       | 5 x i2.2x    | 4 hr     | 12,679.91 | <b>+ 70.81%</b>  | <b>+ 63.99%</b>
+8KB value, 100 workers, no_sync   | 5 x i2.2x    | 6 hr     | 14,100.19 | <b>+ 16.15%</b>  | <b>+ 35.92%</b>
+8KB value, 50 workers, no_sync    | 5 x d2.2x    | 6 hr     | 10,400.29 | <b>+  8.37%</b>  | <b>+ 23.51%</b> 

-Leveldb achieved more work in a sense during the test, as the test was run outside of the compaction window for leveled - so the on-disk size of the leveled store was higher as no replaced values had been compacted.  Test 6 below will examine the impact of the compaction window on throughput.  
+Leveled, like bitcask, will defer compaction work until a designated compaction window, and these tests were run outside of that compaction window.  So although the throughput of leveldb is lower, it has no deferred work at the end of the test.  Future testing work is scheduled to examine leveled throughput during a compaction window.  

-On the flip side, it could be argued that the 73% difference under-estimates the difference in that the greater volume achieved meant that by the back-end of the test more GETs were requiring fetches in leveled when compared with leveldb.  Because of this, by the last ten minutes of the test, the throughput advantage had been reduced to <b>55.4%</b>.
+As a general rule, looking at the resource utilisation during the tests, the following conclusions can be drawn:

-## Riak Cluster Test - 2
+- When unconstrained by disk I/O limits, leveldb can achieve a greater throughput rate than leveled.
+- During these tests leveldb is frequently constrained by disk I/O limits, and the frequency with which it is constrained increases the longer the test is run for.
+- leveled is almost always constrained by CPU, or by the limits imposed by response latency and the number of concurrent workers.
+- Write amplification is the primary delta in disk contention between leveldb and leveled - as leveldb is amplifying the writing of values not just keys it is creating a significantly larger 'background noise' of disk activity, and that noise is sufficiently variable that it invokes response time volatility even when r and w values are less than n.
+- leveled has substantially lower tail latency, especially on PUTs.
+- leveled throughput would be increased by adding concurrent workers, and increasing the available CPU.
+- leveldb throughput would be increased by having improved disk i/o. 

-to be completed ..

-As above but on d2.2xlarge EC2 nodes for HDD comparison
-
-## Riak Cluster Test - 3
-
-to be completed ..
-
-Testing with optimised GET FSM (which checks HEAD requests from r nodes, and only GET request from one node if no sibling resolution required)
-
-## Riak Cluster Test - 4
-
-to be completed ..
-
-Testing with optimised PUT FSM (switching from GET before PUT to HEAD before PUT)
-
-## Riak Cluster Test - 5
+## Riak Cluster Test - Phase 2

 to be completed ..

 Testing with changed hashtree logic in Riak so key/clock scan is effective

-## Riak Cluster Test - 6
+## Riak Cluster Test - Phase 3

 to be completed ..

 Testing during a journal compaction window

-## Riak Cluster Test - 7
+## Riak Cluster Test - Phase 4

 to be completed ..

--- a/docs/WHY.md
+++ b/docs/WHY.md
@ -2,7 +2,7 @@

 ## Why not just use RocksDB? 

-Well that wouldn't be interesting.
+Well that wouldn't have been as interesting.

 All LSM-trees which evolve off the back of leveldb are trying to improve leveldb in a particular context.  I'm considering larger values, with need for iterators and object time-to-lives, optimisations by supporting HEAD requests and also the problem of running multiple isolated nodes in parallel on a single server.  

@ -38,12 +38,46 @@ From the start I decided that fadvise would be my friend, in part as:

 - It was easier to depend on the operating system to understand and handle LRU management.

- I'm expecting various persisted elements (not just values, but also key ranges) to benefit from compression, so want to maximise the volume of data holdable in memory (by caching the compressed view in the page cache).
+- I'm expecting various persisted elements (not just values, but also key ranges) to benefit from compression, so want to maximise the volume of data which can be held in memory (by caching the compressed view in the page cache).

 - I expect to reduce the page-cache polluting events that can occur in Riak by reducing object scanning.

-Ultimately though, sophisticated memory management is hard, and beyond my capability in the timescale available.
+Ultimately though, sophisticated memory management is hard, and beyond my capability in the timescale available.  
+
+The design may make some caching strategies relatively easy to implement in the future though.  Each file process has its own LoopData, and to that LoopData independent caches can be added.  This is currently used for caching bloom filters and hash index tables, but could be used in a more flexible way.
+
+## Why do the Sequence Numbers not wrap?
+
+Each update increments the sequence number, but the sequence number does not wrap around at any stage, it will grow forever.  The reasoning behind not supporting some sort of wraparound and vacuuming of the sequence number is:
+
+- There is no optimisation based on fitting into a specific size, so no immediate performance benefit from wrapping;
+- Auto-vacuuming is hard to achieve, especially without creating a novel and unexpected operational event that is hard to capacity plan for;
+- The standard Riak node transfer feature will reset the sequence number in the replacement vnode - providing a natural way to address any unexpected issue with the growth in this integer.
+
+## Why not store pointers in the Ledger?
+
+The Ledger value contains the sequence number which is necessary to lookup the value in the Journal.  However, this lookup will require a tree to be walked to find the file, and then a two-stage hashtree lookup to find the pointer in the file.  The location of value is know at the time the Ledger entry is written - so why not include a pointer at this stage?  The Journal would then be Bitcask-like relying on external pointers.
+
+The issue with this mechanism would be the added complexity in journal compaction.  When the journal is compacted it would be necessary to make a series of Ledger updates to reflect the changes in pointers - and this would require some sort of locking strategy to prevent a race between an external value update and an update to the pointer. 
+
+The [WiscKey store](https://www.usenix.org/system/files/conference/fast16/fast16-papers-lu.pdf) achieves this, without being explicit about whether it blocks updates to avoid the race.
+
+The trade-off chosen is that the operational simplicity of decoupling references between the stores is worth the cost of the additional lookup, especially as Riak changes will reduce the relative volume of value fetches (by comparison to metadata/HEAD fetches).

 ## Why make this backwards compatible with OTP16?

-Yes why, why do I have to do this?
+Yes why, why do I have to do this?
+
+## Why name things this way?
+
+The names used in the Actor model are loosely correlated with names used for on-course bookmakers (e.g. Bookie, Clerk, Penciller).  
+
+![](pics/ascot_bookies.jpg "Bookies")
+
+There is no strong reason for drawing this specific link, other than the generic sense that this group represents a tight-nit group of workers passing messages from a front-man (the bookie) to keep a local view of state (a book) for a queue of clients, and where normally these groups are working in a loosely-coupled orchestration with a number of other bookmakers to form a betting market that is converging towards an eventually consistent price.
+
+![](pics/betting_market.jpg "Betting Market")
+
+There were some broad parallels between bookmakers in a market and vnodes in a Riak database, and using the actor names just stuck, even though the analogy is imperfect.  Somehow having a visual model of these mysterious guys in top hats working away, helped me imagine the store in action.
+
+The name LevelEd owes some part to the influence of my eldest son Ed, and some part to the Yorkshire phrasing of the word Head.
--- a/docs/pics/ascot_bookies.jpg
+++ b/docs/pics/ascot_bookies.jpg
--- a/docs/pics/betting_market.jpg
+++ b/docs/pics/betting_market.jpg
--- a/test/volume/cluster_one/output/summary_leveldb_5n_45t.png
+++ b/test/volume/cluster_one/output/summary_leveldb_5n_45t.png
--- a/test/volume/cluster_one/output/summary_leveldb_5n_60t_i2_sync.png
+++ b/test/volume/cluster_one/output/summary_leveldb_5n_60t_i2_sync.png
--- a/test/volume/cluster_one/output/summary_leveled_5n_45t.png
+++ b/test/volume/cluster_one/output/summary_leveled_5n_45t.png
--- a/test/volume/cluster_one/output/summary_leveled_5n_60t_i2_sync.png
+++ b/test/volume/cluster_one/output/summary_leveled_5n_60t_i2_sync.png
--- a/test/volume/cluster_three/output/summary_leveldb_5n_50t_d2_nosync.png
+++ b/test/volume/cluster_three/output/summary_leveldb_5n_50t_d2_nosync.png
--- a/test/volume/cluster_three/output/summary_leveled_5n_50t_d2_nosync.png
+++ b/test/volume/cluster_three/output/summary_leveled_5n_50t_d2_nosync.png
--- a/test/volume/cluster_two/output/NotPresentPerc.png
+++ b/test/volume/cluster_two/output/NotPresentPerc.png
--- a/test/volume/cluster_two/output/summary_leveldb_5n_100t_i2_nosync.png
+++ b/test/volume/cluster_two/output/summary_leveldb_5n_100t_i2_nosync.png
--- a/test/volume/cluster_two/output/summary_leveled_5n_100t_i2_nosync.png
+++ b/test/volume/cluster_two/output/summary_leveled_5n_100t_i2_nosync.png