77
README.md
|
@ -1,29 +1,35 @@
|
||||||
# LevelEd - An Erlang Key-Value store
|
# Leveled - An Erlang Key-Value store
|
||||||
|
|
||||||
## Introduction
|
## Introduction
|
||||||
|
|
||||||
LevelEd is a work-in-progress prototype of a simple Key-Value store based on the concept of Log-Structured Merge Trees, with the following characteristics:
|
Leveled is a work-in-progress prototype of a simple Key-Value store based on the concept of Log-Structured Merge Trees, with the following characteristics:
|
||||||
|
|
||||||
- Optimised for workloads with larger values (e.g. > 4KB).
|
- Optimised for workloads with larger values (e.g. > 4KB).
|
||||||
|
|
||||||
- Explicitly supports HEAD requests in addition to GET requests.
|
- Explicitly supports HEAD requests in addition to GET requests.
|
||||||
- Splits the storage of value between key/metadata and body, and allowing for HEAD requests which have lower overheads than GET requests, and queries which traverse keys/metadatas to be supported with fewer side effects on the page cache.
|
- Splits the storage of value between key/metadata and body,
|
||||||
|
- allowing for HEAD requests which have lower overheads than GET requests, and
|
||||||
|
- queries which traverse keys/metadatas to be supported with fewer side effects on the page cache.
|
||||||
|
|
||||||
- Support for tagging of object types and the implementation of alternative store behaviour based on type.
|
- Support for tagging of object types and the implementation of alternative store behaviour based on type.
|
||||||
- Potentially usable for objects with special retention or merge properties.
|
- Potentially usable for objects with special retention or merge properties.
|
||||||
|
|
||||||
- Support for low-cost clones without locking to provide for scanning queries.
|
- Support for low-cost clones without locking to provide for scanning queries.
|
||||||
- Low cost specifically where there is a need to scan across keys and metadata (not values).
|
- Low cost specifically where there is a need to scan across keys and metadata (not values).
|
||||||
|
|
||||||
- Written in Erlang as a message passing system between Actors.
|
- Written in Erlang as a message passing system between Actors.
|
||||||
|
|
||||||
The store has been developed with a focus on being a potential backend to a Riak KV database, rather than as a generic store.
|
The store has been developed with a focus on being a potential backend to a Riak KV database, rather than as a generic store.
|
||||||
|
|
||||||
The primary aim of developing (yet another) Riak backend is to examine the potential to reduce the broader costs providing sustained throughput in Riak i.e. to provide equivalent throughput on cheaper hardware. It is also anticipated in having a fully-featured pure Erlang backend may assist in evolving new features through the Riak ecosystem which require end-to-end changes, rather than requiring context switching between C++ and Erlang based components.
|
The primary aim of developing (yet another) Riak backend is to examine the potential to reduce the broader costs providing sustained throughput in Riak i.e. to provide equivalent throughput on cheaper hardware. It is also anticipated in having a fully-featured pure Erlang backend may assist in evolving new features through the Riak ecosystem which require end-to-end changes, rather than requiring context switching between C++ and Erlang based components.
|
||||||
|
|
||||||
The store is not expected to offer lower median latency than leveldb (the primary fully-featured Riak backend available today), but it is likely in some cases to offer improvements in throughput.
|
The store is not expected to offer lower median latency than the Basho-enhanced leveldb, but it is likely in some cases to offer improvements in throughput, reduced tail latency and reduced volatility in performance. It is expected that the likelihood of finding improvement will correlate with the average object size, and inversely correlate with the availability of Disk IOPS in the hardware configuration.
|
||||||
|
|
||||||
## More Details
|
## More Details
|
||||||
|
|
||||||
For more details on the store:
|
For more details on the store:
|
||||||
|
|
||||||
- An [introduction](docs/INTRO.md) to LevelEd covers some context to the factors motivating design trade-offs in the store.
|
- An [introduction](docs/INTRO.md) to Leveled covers some context to the factors motivating design trade-offs in the store.
|
||||||
|
|
||||||
- The [design overview](docs/DESIGN.md) explains the actor model used and the basic flow of requests through the store.
|
- The [design overview](docs/DESIGN.md) explains the actor model used and the basic flow of requests through the store.
|
||||||
|
|
||||||
|
@ -33,12 +39,63 @@ For more details on the store:
|
||||||
|
|
||||||
## Is this interesting?
|
## Is this interesting?
|
||||||
|
|
||||||
At the initiation of the project I accepted that making a positive contribution to this space is hard - given the superior brainpower and experience of those that have contributed to the KV store problem space in general, and the Riak backend space in particular.
|
Making a positive contribution to this space is hard - given the superior brainpower and experience of those that have contributed to the KV store problem space in general, and the Riak backend space in particular.
|
||||||
|
|
||||||
The target at inception was to do something interesting, something that articulates through working software the potential for improvement to exist by re-thinking certain key assumptions and trade-offs.
|
The target at inception was to do something interesting, to re-think certain key assumptions and trade-offs, and prove through working software the potential for improvements to be realised.
|
||||||
|
|
||||||
[Initial volume tests](docs/VOLUME.md) indicate that it is at least interesting, with substantial improvements in both throughput (73%) and tail latency (1:20) when compared to eleveldb - when using non-trivial object sizes. The largest improvement was with syncing to disk enabled on solid-state drives, but improvement has also been discovered with this object size without sync being enabled, both on SSDs and traditional hard-disk drives.
|
[Initial volume tests](docs/VOLUME.md) indicate that it is at least interesting. With improvements in throughput for multiple configurations, with this improvement becoming more marked as the test progresses (and the base data volume becomes more realistic).
|
||||||
|
|
||||||
The hope is that LevelEd may be able to support generally more stable and predictable throughput with larger object sizes, especially with larger key-spaces. More importantly, in the scenarios tested the constraint on throughput is more consistently CPU-based, and not disk-based. This potentially makes capacity planning simpler, and opens up the possibility of scaling out to equivalent throughput at much lower cost (as CPU is relatively low cost when compared to disk space at high I/O) - [offering better alignment between resource constraints and the cost of resource](docs/INTRO.md).
|
The delta in the table below is the comparison in Riak performance between the identical test run with a Leveled backend in comparison to Leveldb.
|
||||||
|
|
||||||
More information can be found in the [volume testing section](docs/VOLUME.md).
|
Test Description | Hardware | Duration |Avg TPS | Delta (Overall) | Delta (Last Hour)
|
||||||
|
:---------------------------------|:-------------|:--------:|----------:|-----------------:|-------------------:
|
||||||
|
8KB value, 60 workers, sync | 5 x i2.2x | 4 hr | 12,679.91 | <b>+ 70.81%</b> | <b>+ 63.99%</b>
|
||||||
|
8KB value, 100 workers, no_sync | 5 x i2.2x | 6 hr | 14,100.19 | <b>+ 16.15%</b> | <b>+ 35.92%</b>
|
||||||
|
8KB value, 50 workers, no_sync | 5 x d2.2x | 4 hr | 10,400.29 | <b>+ 8.37%</b> | <b>+ 23.51%</b>
|
||||||
|
|
||||||
|
Tests generally show a 5:1 improvement in tail latency for LevelEd.
|
||||||
|
|
||||||
|
All tests have in common:
|
||||||
|
|
||||||
|
- Target Key volume - 200M with pareto distribution of load
|
||||||
|
- 5 GETs per 1 update
|
||||||
|
- RAID 10 (software) drives
|
||||||
|
- allow_mult=false, lww=false
|
||||||
|
- modified riak optimised for leveled used in leveled tests
|
||||||
|
|
||||||
|
|
||||||
|
The throughput in leveled is generally CPU-bound, whereas in comparative tests for leveledb the throughput was disk bound. This potentially makes capacity planning simpler, and opens up the possibility of scaling out to equivalent throughput at much lower cost (as CPU is relatively low cost when compared to disk space at high I/O) - [offering better alignment between resource constraints and the cost of resource](docs/INTRO.md).
|
||||||
|
|
||||||
|
More information can be found in the [volume testing section](docs/VOLUME.md).
|
||||||
|
|
||||||
|
As a general rule though, the most interesting thing is the potential to enable [new features](docs/FUTURE.md). The tagging of different object types, with an ability to set different rules for both compaction and metadata creation by tag, is a potential enabler for further change. Further, having a separate key/metadata store which can be scanned without breaking the page cache or working against mitigation for write amplifications, is also potentially an enabler to offer features to both the developer and the operator.
|
||||||
|
|
||||||
|
## Next Steps
|
||||||
|
|
||||||
|
Further volume test scenarios are the immediate priority, in particular volume test scenarios with:
|
||||||
|
|
||||||
|
- Alternative object sizes;
|
||||||
|
|
||||||
|
- Significant use of secondary indexes;
|
||||||
|
|
||||||
|
- Use of newly available [EC2 hardware](https://aws.amazon.com/about-aws/whats-new/2017/02/now-available-amazon-ec2-i3-instances-next-generation-storage-optimized-high-i-o-instances/) which potentially is a significant changes to assumptions about hardware efficiency and cost.
|
||||||
|
|
||||||
|
- Create riak_test tests for new Riak features enabled by Leveled.
|
||||||
|
|
||||||
|
## Feedback
|
||||||
|
|
||||||
|
Please create an issue if you have any suggestions. You can ping me @masleeds if you wish
|
||||||
|
|
||||||
|
## Running Leveled
|
||||||
|
|
||||||
|
Unit and current tests in leveled should run with rebar3. Leveled has been tested in OTP18, but it can be started with OTP16 to support Riak (although tests will not work as expected). A new database can be started by running
|
||||||
|
|
||||||
|
```
|
||||||
|
{ok, Bookie} = leveled_bookie:book_start(RootPath, LedgerCacheSize, JournalSize, SyncStrategy)
|
||||||
|
```
|
||||||
|
|
||||||
|
This will start a new Bookie. It will start and look for existing data files, under the RootPath, and start empty if none exist. A LedgerCacheSize of 2000, a JournalSize of 500000000 (500MB) and a SyncStrategy of recovr should work OK.
|
||||||
|
|
||||||
|
The book_start method should respond once startup is complete. The leveled_bookie module includes the full API for external use of the store.
|
||||||
|
|
||||||
|
Read through the [end_to_end test suites](test/end_to_end/) for further guidance.
|
|
@ -12,7 +12,7 @@ The store supports all the required Riak backend capabilities. A number of furt
|
||||||
|
|
||||||
- A bucket size query, which requires traversal only of the Ledger and counts the keys and sums the total on-disk size of the objects within the bucket.
|
- A bucket size query, which requires traversal only of the Ledger and counts the keys and sums the total on-disk size of the objects within the bucket.
|
||||||
|
|
||||||
- A PUT flag that allows a truer meaning to the existing DW setting, allowing flushing to disk to be prompted by the need to meet the DW flag, rather than always/never
|
- A new SW count to operate in parallel to DW, where SW is the number of nodes required to have flushed to disk (not just written to buffer); with SW being configurable to 0 (no vnodes will need to sync this write), 1 (the PUT coordinator will sync, but no other vnodes) or all (all vnodes will be required to sync this write before providing a DW ack). This will allow the expensive disk sync operation to be constrained to the writes for which it is most needed.
|
||||||
|
|
||||||
- Support for a specific Riak tombstone tag where reaping of tombstones can be deferred (by many days) i.e. so that a 'sort-of-keep' deletion strategy can be followed that will eventually garbage collect without the need to hold pending full deletion state in memory.
|
- Support for a specific Riak tombstone tag where reaping of tombstones can be deferred (by many days) i.e. so that a 'sort-of-keep' deletion strategy can be followed that will eventually garbage collect without the need to hold pending full deletion state in memory.
|
||||||
|
|
||||||
|
@ -42,10 +42,12 @@ There is some work required before LevelEd could be considered production ready:
|
||||||
|
|
||||||
The following Riak features have been implemented
|
The following Riak features have been implemented
|
||||||
|
|
||||||
### LevelEd Backend
|
### Leveled Backend
|
||||||
|
|
||||||
Branch: [mas-leveleddb](https://github.com/martinsumner/riak_kv/tree/mas-leveleddb)
|
Branch: [mas-leveleddb](https://github.com/martinsumner/riak_kv/tree/mas-leveleddb)
|
||||||
|
|
||||||
|
Branched-From: [Basho/develop](https://github.com/basho/riak_kv)
|
||||||
|
|
||||||
Description:
|
Description:
|
||||||
|
|
||||||
The leveled backend has been implemented with some basic manual functional tests. The backend has the following capabilities:
|
The leveled backend has been implemented with some basic manual functional tests. The backend has the following capabilities:
|
||||||
|
@ -71,6 +73,8 @@ Note - the technique would work in leveldb and memory backends as well (and perh
|
||||||
|
|
||||||
Branch: [mas-leveled-getfsm](https://github.com/martinsumner/riak_kv/tree/mas-leveled-getfsm)
|
Branch: [mas-leveled-getfsm](https://github.com/martinsumner/riak_kv/tree/mas-leveled-getfsm)
|
||||||
|
|
||||||
|
Branched-From: [mas-leveleddb](https://github.com/martinsumner/riak_kv/tree/mas-leveleddb)
|
||||||
|
|
||||||
Description:
|
Description:
|
||||||
|
|
||||||
In standard Riak the Riak node that receives a GET request starts a riak_kv_get_fsm to handle that request. This FSM goes through the following primary states:
|
In standard Riak the Riak node that receives a GET request starts a riak_kv_get_fsm to handle that request. This FSM goes through the following primary states:
|
||||||
|
@ -95,3 +99,19 @@ So rather than doing three Key/Metadata/Body backend lookups for every request,
|
||||||
The feature will not at present work safely with legacy vclocks. This branch generally relies on vector clock comparison only for equality checking, and removes some of the relatively expensive whole body equality tests (either as a result of set:from_list/1 or riak_object:equal/2), which are believed to be a legacy of issues with pre-dvv clocks.
|
The feature will not at present work safely with legacy vclocks. This branch generally relies on vector clock comparison only for equality checking, and removes some of the relatively expensive whole body equality tests (either as a result of set:from_list/1 or riak_object:equal/2), which are believed to be a legacy of issues with pre-dvv clocks.
|
||||||
|
|
||||||
In tests, the benefit of this may not be that significant - as the primary resource saved is disk/network, and so if these are not the resources under pressure, the gain may not be significant. In tests bound by CPU not disks, only a 10% improvement has so far been measured with this feature.
|
In tests, the benefit of this may not be that significant - as the primary resource saved is disk/network, and so if these are not the resources under pressure, the gain may not be significant. In tests bound by CPU not disks, only a 10% improvement has so far been measured with this feature.
|
||||||
|
|
||||||
|
### PUT -> Using HEAD
|
||||||
|
|
||||||
|
Branch: [mas-leveled-putfsm](https://github.com/martinsumner/riak_kv/tree/mas-leveled-putfsm)
|
||||||
|
|
||||||
|
Branched-From: [mas-leveled-getfsm](https://github.com/martinsumner/riak_kv/tree/mas-leveled-getfsm)
|
||||||
|
|
||||||
|
Description:
|
||||||
|
|
||||||
|
The standard PUT process for Riak requires the PUT to be forwarded to a coordinating vnode first. The coordinating PUT process requires the object to be fetched from the local vnode only, and a new updated Object created. The remaining n-1 vnodes are then sent the update object as a PUT, and once w/dw/pw nodes have responded the PUT is acknowledged.
|
||||||
|
|
||||||
|
The other n-1 vnodes must also do a local GET before the vnode PUT (so as not to erase a more up-to-date value that may not have been present at the coordinating vnode).
|
||||||
|
|
||||||
|
This branch changes the behaviour slightly at the non-coordinating vnodes. These vnodes will now try a HEAD request before the local PUT (not a GET request), and if the HEAD request contains a vclock which is <strong>dominated</strong> by the updated PUT, it will not attempt to fetch the whole object for the syntactic merge.
|
||||||
|
|
||||||
|
This should save two object fetches (where n=3) in most circumstances.
|
|
@ -66,9 +66,9 @@ The evolution of leveledb in Riak, from the original Google-provided store to th
|
||||||
|
|
||||||
The original leveledb considered in part the hardware economics of the phone where there are clear constraints around CPU usage - due to both form-factor and battery life, and where disk space may be at a greater premium than disk IOPS. Some of the evolution of eleveldb is down to the Riak-specific problem of needing to run multiple stores on a single server, where even load distribution may lead to a synchronisation of activity. Much of the evolution is also about how to make better use of the continuous availability of CPU resource, in the face of the relative scarcity of disk resource. Changes such as overlapping files at level 1, hot threads, compression improvements etc all move eleveldb in the direction of being easier on disk at the cost of CPU; and the hardware economics of servers would indicate this is a wise choice
|
The original leveledb considered in part the hardware economics of the phone where there are clear constraints around CPU usage - due to both form-factor and battery life, and where disk space may be at a greater premium than disk IOPS. Some of the evolution of eleveldb is down to the Riak-specific problem of needing to run multiple stores on a single server, where even load distribution may lead to a synchronisation of activity. Much of the evolution is also about how to make better use of the continuous availability of CPU resource, in the face of the relative scarcity of disk resource. Changes such as overlapping files at level 1, hot threads, compression improvements etc all move eleveldb in the direction of being easier on disk at the cost of CPU; and the hardware economics of servers would indicate this is a wise choice
|
||||||
|
|
||||||
### Planning for LevelEd
|
### Planning for Leveled
|
||||||
|
|
||||||
The primary design differentiation between LevelEd and LevelDB is the separation of the key store (known as the Ledger in LevelEd) and the value store (known as the journal). The Journal is like a continuous extension of the nursery log within LevelDB, only with a gradual evolution into [CDB files](https://en.wikipedia.org/wiki/Cdb_(software)) so that file offset pointers are not required to exist permanently in memory. The Ledger is a merge tree structure, with values substituted with metadata and a sequence number - where the sequence number can be used to find the value in the Journal.
|
The primary design differentiation between LevelEd and LevelDB is the separation of the key store (known as the Ledger in Leveled) and the value store (known as the journal). The Journal is like a continuous extension of the nursery log within LevelDB, only with a gradual evolution into [CDB files](https://en.wikipedia.org/wiki/Cdb_(software)) so that file offset pointers are not required to exist permanently in memory. The Ledger is a merge tree structure, with values substituted with metadata and a sequence number - where the sequence number can be used to find the value in the Journal.
|
||||||
|
|
||||||
This is not an original idea, the LSM-Tree paper specifically talked about the trade-offs of placing identifiers rather than values in the merge tree:
|
This is not an original idea, the LSM-Tree paper specifically talked about the trade-offs of placing identifiers rather than values in the merge tree:
|
||||||
|
|
||||||
|
@ -84,7 +84,7 @@ So the hypothesis that separating Keys and Values may be optimal for LSM-Trees i
|
||||||
|
|
||||||
## Being Operator Friendly
|
## Being Operator Friendly
|
||||||
|
|
||||||
The LSM-Tree paper focuses on hardware trade-offs in database design. LevelEd is focused on the job of being a backend to a Riak database, and the Riak database is opinionated on the trade-off between developer and operator productivity. Running a Riak database imposes constraints and demands on developers - there are things the developer needs to think hard about: living without transactions, considering the resolution of siblings, manual modelling for query optimisation.
|
The LSM-Tree paper focuses on hardware trade-offs in database design. Leveled is focused on the job of being a backend to a Riak database, and the Riak database is opinionated on the trade-off between developer and operator productivity. Running a Riak database imposes constraints and demands on developers - there are things the developer needs to think hard about: living without transactions, considering the resolution of siblings, manual modelling for query optimisation.
|
||||||
|
|
||||||
However, in return for this pain there is great reward, a reward which is gifted to the operators of the service. Riak clusters are reliable and predictable, and operational processes are slim and straight forward - preparation for managing a Riak cluster in production needn't go much beyond rehearsing magical cure-alls of the the node stop/start and node join/leave processes. At the NHS, where we have more than 50 Riak nodes in 24 by 365 business critical operations, it is not untypical to go more than 28-days without anyone logging on to a database node. This is a relief for those of us who have previously lived in the world with databases with endless configuration parameters to test or blame for issues, where you always seem to be the unlucky one who suffer the outages "never seen in any other customer", where the databases come with ever more complicated infrastructure dependencies and and where DBAs need to be constantly at-hand to analyse reports, kill rogue queries and re-run the query optimiser as an answer to the latest 'blip' in performance.
|
However, in return for this pain there is great reward, a reward which is gifted to the operators of the service. Riak clusters are reliable and predictable, and operational processes are slim and straight forward - preparation for managing a Riak cluster in production needn't go much beyond rehearsing magical cure-alls of the the node stop/start and node join/leave processes. At the NHS, where we have more than 50 Riak nodes in 24 by 365 business critical operations, it is not untypical to go more than 28-days without anyone logging on to a database node. This is a relief for those of us who have previously lived in the world with databases with endless configuration parameters to test or blame for issues, where you always seem to be the unlucky one who suffer the outages "never seen in any other customer", where the databases come with ever more complicated infrastructure dependencies and and where DBAs need to be constantly at-hand to analyse reports, kill rogue queries and re-run the query optimiser as an answer to the latest 'blip' in performance.
|
||||||
|
|
||||||
|
|
179
docs/VOLUME.md
|
@ -4,149 +4,124 @@
|
||||||
|
|
||||||
Initial volume tests have been [based on standard basho_bench eleveldb test](../test/volume/single_node/examples) to run multiple stores in parallel on the same node and and subjecting them to concurrent pressure.
|
Initial volume tests have been [based on standard basho_bench eleveldb test](../test/volume/single_node/examples) to run multiple stores in parallel on the same node and and subjecting them to concurrent pressure.
|
||||||
|
|
||||||
This showed a [relative positive performance for leveled](VOLUME_PRERIAK.md) for both population and load. This also showed that although the LevelEd throughput was relatively stable it was still subject to fluctuations related to CPU constraints. Prior to moving on to full Riak testing, a number of changes where then made to LevelEd to reduce the CPU load in particular during merge events.
|
This showed a [relative positive performance for leveled](VOLUME_PRERIAK.md) for both population and load. This also showed that although the leveled throughput was relatively stable, it was still subject to fluctuations related to CPU constraints - especially as compaction of the ledger was a CPU intensive activity. Prior to moving on to full Riak testing, a number of changes where then made to leveled to reduce the CPU load during these merge events.
|
||||||
|
|
||||||
## Riak Cluster Test - 1
|
## Initial Riak Cluster Tests
|
||||||
|
|
||||||
The First test on a Riak Cluster has been based on the following configuration:
|
Testing in a Riak cluster, has been based on like-for-like comparisons between leveldb and leveled - except that leveled was run within a Riak [modified](FUTURE.md) to use HEAD not GET requests when HEAD requests are sufficient.
|
||||||
|
|
||||||
|
The initial testing was based on simple gets and updates, with 5 gets for every update.
|
||||||
|
|
||||||
|
### Basic Configuration - Initial Tests
|
||||||
|
|
||||||
|
The configuration consistent across all tests is:
|
||||||
|
|
||||||
- A 5 node cluster
|
- A 5 node cluster
|
||||||
- Using i2.2xlarge EC2 nodes with mirrored SSD drives (for data partition only)
|
- Using i2.2xlarge EC2 or d2.2xlarge nodes with mirrored/RAID10 drives (for data partition only)
|
||||||
- noop scheduler, transparent huge pages disabled, ext4 partition
|
|
||||||
- A 64 vnode ring-size
|
|
||||||
- 45 concurrent basho_bench threads (basho_bench run on a separate machine) running at max
|
|
||||||
- AAE set to passive
|
|
||||||
- sync writes enabled (on both backends)
|
|
||||||
- An object size of 8KB
|
|
||||||
- A pareto distribution of requests with a keyspace of 200M keys
|
|
||||||
- 5 GETs for each UPDATE
|
|
||||||
- 4 hour test run
|
|
||||||
|
|
||||||
This test showed a <b>73.9%</b> improvement in throughput when using LevelEd, but more importantly a huge improvement in variance in tail latency. Through the course of the test the average of the maximum response times (in each 10s period) were
|
|
||||||
|
|
||||||
leveled GET mean(max) | eleveldb GET mean(max)
|
|
||||||
:-------------------------:|:-------------------------:
|
|
||||||
21.7ms | 410.2ms
|
|
||||||
|
|
||||||
leveled PUT mean(max) | eleveldb PUT mean(max)
|
|
||||||
:-------------------------:|:-------------------------:
|
|
||||||
101.5ms | 2,301.6ms
|
|
||||||
|
|
||||||
Tail latency under load is around in leveled is less than 5% of the comparable value in eleveldb.
|
|
||||||
|
|
||||||
leveled Results | eleveldb Results
|
|
||||||
:-------------------------:|:-------------------------:
|
|
||||||
 | 
|
|
||||||
|
|
||||||
### Lies, damned lies etc
|
|
||||||
|
|
||||||
To a certain extent this should not be too unexpected - leveled is design to reduce write amplification, without write amplification the persistent write load gives leveled an advantage. The frequent periods of poor performance in leveldb appear to be coordinated with periods of very high await times on nodes during merge jobs, which may involve up to o(1GB) of write activity.
|
|
||||||
|
|
||||||
Also the 5:1 ratio of GET:UPDATE is not quite that as:
|
|
||||||
|
|
||||||
- each UPDATE requires an external Riak GET (as well as the internal GETs);
|
|
||||||
|
|
||||||
- the empty nature of the database at the test start means that there are no actual value fetches initially (just not-present response) and only 50% of fetches get a value by the end of the test (much less for leveldb as there is less volume PUT during the test).
|
|
||||||
|
|
||||||
When testing on a single node cluster (with a smaller ring size, and a smaller keyspace) the relative benefit of leveled appears to be much smaller. One big difference between the single node and multi-node testing undertaken is that between the tests the disk was switched from using a single drive to using a mirrored pair. It is suspected that the amplified improvement between single-node test and multi-node tests is related in-part to the cost of software-based mirroring exaggerating write contention to disk.
|
|
||||||
|
|
||||||
Leveldb achieved more work in a sense during the test, as the test was run outside of the compaction window for leveled - so the on-disk size of the leveled store was higher as no replaced values had been compacted. Test 6 below will examine the impact of the compaction window on throughput.
|
|
||||||
|
|
||||||
On the flip side, it could be argued that the 73% difference under-estimates the difference in that the greater volume achieved meant that by the back-end of the test more GETs were requiring fetches in leveled when compared with leveldb. Because of this, by the last ten minutes of the test, the throughput advantage had been reduced to <b>55.4%</b>.
|
|
||||||
|
|
||||||
## Riak Cluster Test - 2
|
|
||||||
|
|
||||||
An identical test was run as above, but with d2.2xlarge instances, so that performance on spinning disks could be tested for comparison. This however tanked with sync_on_write enabled regardless if this was tested with leveled or leveldb - with just 1,000 transactions per second supported.
|
|
||||||
|
|
||||||
Although append_only writes are being used, almost every write still requires a disk head movement even if the server had all reads handled by in-memory cache (as there are normally more vnodes on the server than there are disk heads). It is clear that without a Flash-Backed Write Cache, spinning disks are unusable as the sole storage mechanism.
|
|
||||||
|
|
||||||
Also tested was this same d2.2xlarge cluster, but without sync_on_write. Results were:
|
|
||||||
|
|
||||||
leveled Results | eleveldb Results
|
|
||||||
:-------------------------:|:-------------------------:
|
|
||||||
 | 
|
|
||||||
|
|
||||||
This test showed a <b>26.7%</b> improvement in throughput when using leveled. The improvement in tail latency in this test had leveled at about <b>25%</b> of the tail latency of leveldb - so still a substantial improvement, although not of the same order of magnitude as the sync_on_write test with i2.2xlarge and leveldb.
|
|
||||||
|
|
||||||
This indicates that without sync_on_writes, there remains potential value in using HDD drives, with either leveldb and leveled. It seems reasonably likely (although it remains unproven) that spinning drives with a hardware RAID and flash-backed write cache may perform reasonably even with sync on writes.
|
|
||||||
|
|
||||||
## Riak Cluster Test - 3
|
|
||||||
|
|
||||||
Testing with optimised GET FSM (which checks HEAD requests from r nodes, and only GET request from one node if no sibling resolution required).
|
|
||||||
|
|
||||||
The test has been based on the following configuration:
|
|
||||||
|
|
||||||
- A 5 node cluster
|
|
||||||
- Using i2.2xlarge EC2 nodes with mirrored SSD drives (for data partition only)
|
|
||||||
- deadline scheduler, transparent huge pages disabled, ext4 partition
|
- deadline scheduler, transparent huge pages disabled, ext4 partition
|
||||||
- A 64 vnode ring-size
|
- A 64 partition ring-size
|
||||||
- 100 concurrent basho_bench threads (basho_bench run on a separate machine) running at max
|
|
||||||
- AAE set to passive
|
- AAE set to passive
|
||||||
- sync writes disabled (on both backends)
|
|
||||||
- An object size of 8KB
|
|
||||||
- A pareto distribution of requests with a keyspace of 200M keys
|
- A pareto distribution of requests with a keyspace of 200M keys
|
||||||
- 5 GETs for each UPDATE
|
- 5 GETs for each UPDATE
|
||||||
- 6 hour test run
|
|
||||||
|
|
||||||
Note the changes from the first cluster test:
|
### Mid-Size Object, SSDs, Sync-On-Write
|
||||||
|
|
||||||
- The recommended disk scheduler was used (the change was mistakenly omitted in the previous test).
|
This test has the following specific characteristics
|
||||||
- Sync_writes were disabled (for comparison purposes with the previous test, as Basho have stated that they don't expect customers to use this feature - although some customers do use this feature).
|
|
||||||
- More basho_bench threads were used to push the extra capacity created by removing the syncing to disk.
|
|
||||||
- The test run was extended to six hours to confirm a change in the divergence between leveled and leveldb which occurred as the test progressed.
|
|
||||||
|
|
||||||
This test showed a <b>12.5%</b> improvement in throughput when using LevelEd, across the whole test. The difference between the backends expands as the test progresses though, so in the last hour of the test there was a <b>29.5%</b> improvement in throughput when using LevelEd.
|
- An 8KB value size (based on crypto:rand_bytes/1 - so cannot be effectively compressed)
|
||||||
|
- 60 concurrent basho_bench workers running at 'max'
|
||||||
|
- i2.2xlarge instances
|
||||||
|
- allow_mult=false, lww=false
|
||||||
|
|
||||||
There was again a significant improvement in variance in tail latency, although not as large a difference as seen when testing with writing to disks enabled. Through the course of the test the average of the maximum response times (in each 10s period) were
|
Comparison charts for this test:
|
||||||
|
|
||||||
leveled GET mean(max) | eleveldb GET mean(max)
|
Riak + leveled | Riak + eleveldb
|
||||||
:-------------------------:|:-------------------------:
|
:-------------------------:|:-------------------------:
|
||||||
250.3ms | 1,090.1ms
|
 | 
|
||||||
|
|
||||||
leveled PUT mean(max) | eleveldb PUT mean(max)
|
### Mid-Size Object, SSDs, No Sync-On-Write
|
||||||
|
|
||||||
|
This test has the following specific characteristics
|
||||||
|
|
||||||
|
- An 8KB value size (based on crypto:rand_bytes/1 - so cannot be effectively compressed)
|
||||||
|
- 100 concurrent basho_bench workers running at 'max'
|
||||||
|
- i2.2xlarge instances
|
||||||
|
- allow_mult=false, lww=false
|
||||||
|
|
||||||
|
Comparison charts for this test:
|
||||||
|
|
||||||
|
Riak + leveled | Riak + eleveldb
|
||||||
:-------------------------:|:-------------------------:
|
:-------------------------:|:-------------------------:
|
||||||
369.0ms | 1,817.3ms
|
 | 
|
||||||
|
|
||||||
The difference in volatility is clear when graphing the test performance:
|
### Mid-Size Object, HDDs, No Sync-On-Write
|
||||||
|
|
||||||
leveled Results | eleveldb Results
|
This test has the following specific characteristics
|
||||||
|
|
||||||
|
- An 8KB value size (based on crypto:rand_bytes/1 - so cannot be effectively compressed)
|
||||||
|
- 50 concurrent basho_bench workers running at 'max'
|
||||||
|
- d2.2xlarge instances
|
||||||
|
- allow_mult=false, lww=false
|
||||||
|
|
||||||
|
Comparison charts for this test:
|
||||||
|
|
||||||
|
Riak + leveled | Riak + eleveldb
|
||||||
:-------------------------:|:-------------------------:
|
:-------------------------:|:-------------------------:
|
||||||
 | 
|
 | 
|
||||||
|
|
||||||
Repeating this test over 8 hours with a smaller keyspace and a uniform (not pareto) key distribution yields similar results. This change in test parameters was done to reduce the percentage of not_founds towards the end of the test (down to about 20%).
|
Note that there is a clear inflexion point when throughput starts to drop sharply at about the hour mark into the test.
|
||||||
|
This is the stage when the volume of data has begun to exceed the volume supportable in cache, and so disk activity begins to be required for GET operations with increasing frequency.
|
||||||
|
|
||||||
leveled Results | eleveldb Results
|
### Half-Size Object, SSDs, No Sync-On-Write
|
||||||
:-------------------------:|:-------------------------:
|
|
||||||
 | 
|
|
||||||
|
|
||||||
|
to be completed
|
||||||
|
|
||||||
|
### Double-Size Object, SSDs, No Sync-On-Write
|
||||||
|
|
||||||
|
to be completed
|
||||||
|
|
||||||
### Lies, damned lies etc
|
### Lies, damned lies etc
|
||||||
|
|
||||||
Monitoring the resource utilisation between the two tests makes the difference between the two backends clearer. Throughout the test run with the leveled backend, each node is almost exclusively constrained by CPU, with disk utilisation staying well within limits.
|
The first thing to note about the test is the impact of the pareto distribution and the start from an empty store, on what is actually being tested. At the start of the test there is a 0% chance of a GET request actually finding an object. Normally, it will be 3 hours into the test before a GET request will have a 50% chance of finding an object.
|
||||||
|
|
||||||
The leveldb backend can achieve higher throughput when it is constrained by CPU compared to leveled, but as the test runs it becomes constrained by disk busyness with increasing frequency. When leveldb is constrained by disk throughput can still become significantly constrained even without disk syncing enabled (i.e. throughput can halve/double in adjoining 20s measurement periods).
|

|
||||||
|
|
||||||
Leveled is consistently constrained by one factor, giving more predicable throughput. Leveldb performs well when constrained by CPU availability, but is much more demanding on disk, and when disk becomes the constraint throughput becomes volatile.
|
Both leveled and leveldb are optimised for finding non-presence through the use of bloom filters, so the comparison is not unduly influenced by this. However, the workload at the end of the test is both more realistic (in that objects are found), and harder if the previous throughput had been greater (in that more objects are found).
|
||||||
|
|
||||||
## Riak Cluster Test - 4
|
So it is better to focus on the results at the tail of the tests, as at the tail the results are a more genuine reflection of behaviour against the advertised test parameters.
|
||||||
|
|
||||||
to be completed ..
|
Test Description | Hardware | Duration |Avg TPS | Delta (Overall) | Delta (Last Hour)
|
||||||
|
:---------------------------------|:-------------|:--------:|----------:|-----------------:|-------------------:
|
||||||
|
8KB value, 60 workers, sync | 5 x i2.2x | 4 hr | 12,679.91 | <b>+ 70.81%</b> | <b>+ 63.99%</b>
|
||||||
|
8KB value, 100 workers, no_sync | 5 x i2.2x | 6 hr | 14,100.19 | <b>+ 16.15%</b> | <b>+ 35.92%</b>
|
||||||
|
8KB value, 50 workers, no_sync | 5 x d2.2x | 6 hr | 10,400.29 | <b>+ 8.37%</b> | <b>+ 23.51%</b>
|
||||||
|
|
||||||
Testing with optimised PUT FSM (switching from GET before PUT to HEAD before PUT)
|
Leveled, like bitcask, will defer compaction work until a designated compaction window, and these tests were run outside of that compaction window. So although the throughput of leveldb is lower, it has no deferred work at the end of the test. Future testing work is scheduled to examine leveled throughput during a compaction window.
|
||||||
|
|
||||||
## Riak Cluster Test - 5
|
As a general rule, looking at the resource utilisation during the tests, the following conclusions can be drawn:
|
||||||
|
|
||||||
|
- When unconstrained by disk I/O limits, leveldb can achieve a greater throughput rate than leveled.
|
||||||
|
- During these tests leveldb is frequently constrained by disk I/O limits, and the frequency with which it is constrained increases the longer the test is run for.
|
||||||
|
- leveled is almost always constrained by CPU, or by the limits imposed by response latency and the number of concurrent workers.
|
||||||
|
- Write amplification is the primary delta in disk contention between leveldb and leveled - as leveldb is amplifying the writing of values not just keys it is creating a significantly larger 'background noise' of disk activity, and that noise is sufficiently variable that it invokes response time volatility even when r and w values are less than n.
|
||||||
|
- leveled has substantially lower tail latency, especially on PUTs.
|
||||||
|
- leveled throughput would be increased by adding concurrent workers, and increasing the available CPU.
|
||||||
|
- leveldb throughput would be increased by having improved disk i/o.
|
||||||
|
|
||||||
|
|
||||||
|
## Riak Cluster Test - Phase 2
|
||||||
|
|
||||||
to be completed ..
|
to be completed ..
|
||||||
|
|
||||||
Testing with changed hashtree logic in Riak so key/clock scan is effective
|
Testing with changed hashtree logic in Riak so key/clock scan is effective
|
||||||
|
|
||||||
## Riak Cluster Test - 6
|
## Riak Cluster Test - Phase 3
|
||||||
|
|
||||||
to be completed ..
|
to be completed ..
|
||||||
|
|
||||||
Testing during a journal compaction window
|
Testing during a journal compaction window
|
||||||
|
|
||||||
## Riak Cluster Test - 7
|
## Riak Cluster Test - Phase 4
|
||||||
|
|
||||||
to be completed ..
|
to be completed ..
|
||||||
|
|
||||||
|
|
Before Width: | Height: | Size: 117 KiB |
Before Width: | Height: | Size: 108 KiB |
Before Width: | Height: | Size: 109 KiB |
After Width: | Height: | Size: 109 KiB |
Before Width: | Height: | Size: 93 KiB |
Before Width: | Height: | Size: 70 KiB |
Before Width: | Height: | Size: 73 KiB |
After Width: | Height: | Size: 73 KiB |
Before Width: | Height: | Size: 114 KiB |
Before Width: | Height: | Size: 100 KiB |
Before Width: | Height: | Size: 113 KiB |
Before Width: | Height: | Size: 99 KiB |
Before Width: | Height: | Size: 79 KiB |
Before Width: | Height: | Size: 81 KiB |
Before Width: | Height: | Size: 76 KiB |
Before Width: | Height: | Size: 78 KiB |
After Width: | Height: | Size: 99 KiB |
After Width: | Height: | Size: 76 KiB |
BIN
test/volume/cluster_two/output/NotPresentPerc.png
Normal file
After Width: | Height: | Size: 41 KiB |
After Width: | Height: | Size: 99 KiB |
After Width: | Height: | Size: 78 KiB |
Before Width: | Height: | Size: 107 KiB |
Before Width: | Height: | Size: 99 KiB |
Before Width: | Height: | Size: 88 KiB |
Before Width: | Height: | Size: 94 KiB |
Before Width: | Height: | Size: 77 KiB |
Before Width: | Height: | Size: 76 KiB |