diff --git a/README.md b/README.md
index 3854df1..e5ba1a3 100644
--- a/README.md
+++ b/README.md
@@ -6,8 +6,8 @@ Leveled is a work-in-progress prototype of a simple Key-Value store based
- Optimised for workloads with larger values (e.g. > 4KB).
-- Explicitly supports HEAD requests in addition to GET requests.
- - Splits the storage of value between keys/metadata and body,
+- Explicitly supports HEAD requests in addition to GET requests.
+ - Splits the storage of value between keys/metadata and body,
- Stores keys/metadata in a merge tree and the full object in a journal of [CDB files](https://en.wikipedia.org/wiki/Cdb_(software))
- allowing for HEAD requests which have lower overheads than GET requests, and
- queries which traverse keys/metadatas to be supported with fewer side effects on the page cache.
@@ -20,7 +20,7 @@ Leveled is a work-in-progress prototype of a simple Key-Value store based
- Written in Erlang as a message passing system between Actors.
-The store has been developed with a focus on being a potential backend to a Riak KV database, rather than as a generic store. It is intended to be a fully-featured backend - including support for secondary indexes, multiple fold types and auto-expiry of objects.
+The store has been developed with a focus on being a potential backend to a Riak KV database, rather than as a generic store. It is intended to be a fully-featured backend - including support for secondary indexes, multiple fold types and auto-expiry of objects.
An optimised version of Riak KV has been produced in parallel which will exploit the availability of HEAD requests (to access object metadata including version vectors), where a full GET is not required. This, along with reduced write amplification when compared to leveldb, is expected to offer significant improvement in the volume and predictability of throughput for workloads with larger (> 4KB) object sizes, as well as reduced tail latency.
@@ -50,16 +50,17 @@ Test Description | Hardware | Duration |Avg TPS | TPS De
:---------------------------------|:-------------|:--------:|----------:|-----------------:|-------------------:
8KB value, 60 workers, sync | 5 x i2.2x | 4 hr | 12,679.91 | + 70.81% | + 63.99%
8KB value, 100 workers, no_sync | 5 x i2.2x | 6 hr | 14,100.19 | + 16.15% | + 35.92%
-8KB value, 50 workers, no_sync | 5 x d2.2x | 4 hr | 10,400.29 | + 8.37% | + 23.51%
+8KB value, 50 workers, no_sync | 5 x d2.2x | 4 hr | 10,400.29 | + 8.37% | + 23.51%
4KB value, 100 workers, no_sync | 5 x i2.2x | 6 hr | 14,993.95 | - 10.44% | - 4.48%
16KB value, 60 workers, no_sync | 5 x i2.2x | 6 hr | 11,167.44 | + 80.48% | + 113.55%
+8KB value, 80, workers, no_sync, 1% 2i | 5 x i2.2x | 6 hr | 9,855.96 | + 4.48% | + 22.36%
Tests generally show a 5:1 improvement in tail latency for leveled.
All tests have in common:
- Target Key volume - 200M with pareto distribution of load
-- 5 GETs per 1 update
+- 5 GETs per 1 update
- RAID 10 (software) drives
- allow_mult=false, lww=false
- modified riak optimised for leveled used in leveled tests
@@ -117,7 +118,7 @@ Running in Riak requires one of the branches of riak_kv referenced [here](docs/F
Building this from source as part of Riak will require a bit of fiddling around.
- clone and build [riak](https://github.com/martinsumner/riak/tree/mas-leveleddb)
-- cd deps
+- cd deps
- rm -rf riak_kv
- git clone -b mas-leveled-putfsm --single-branch https://github.com/martinsumner/riak_kv.git
- cd ..
diff --git a/docs/VOLUME.md b/docs/VOLUME.md
index 371dd57..a3dd92d 100644
--- a/docs/VOLUME.md
+++ b/docs/VOLUME.md
@@ -2,7 +2,7 @@
## Parallel Node Testing
-Initial volume tests have been [based on standard basho_bench eleveldb test](../test/volume/single_node/examples) to run multiple stores in parallel on the same node and and subjecting them to concurrent pressure.
+Initial volume tests have been [based on standard basho_bench eleveldb test](../test/volume/single_node/examples) to run multiple stores in parallel on the same node and and subjecting them to concurrent pressure.
This showed a [relative positive performance for leveled](VOLUME_PRERIAK.md) for both population and load. This also showed that although the leveled throughput was relatively stable, it was still subject to fluctuations related to CPU constraints - especially as compaction of the ledger was a CPU intensive activity. Prior to moving on to full Riak testing, a number of changes where then made to leveled to reduce the CPU load during these merge events.
@@ -24,7 +24,7 @@ The configuration consistent across all tests is:
- A pareto distribution of requests with a keyspace of 200M keys
- 5 GETs for each UPDATE
-### Mid-Size Object, SSDs, Sync-On-Write
+### Mid-Size Object, SSDs, Sync-On-Write
This test has the following specific characteristics
@@ -40,7 +40,7 @@ Riak + leveled | Riak + eleveldb
:-------------------------:|:-------------------------:
 | 
-### Mid-Size Object, SSDs, No Sync-On-Write
+### Mid-Size Object, SSDs, No Sync-On-Write
This test has the following specific characteristics
@@ -56,7 +56,7 @@ Riak + leveled | Riak + eleveldb
:-------------------------:|:-------------------------:
 | 
-### Mid-Size Object, HDDs, No Sync-On-Write
+### Mid-Size Object, HDDs, No Sync-On-Write
This test has the following specific characteristics
@@ -75,7 +75,7 @@ Riak + leveled | Riak + eleveldb
Note that there is a clear inflexion point when throughput starts to drop sharply at about the hour mark into the test.
This is the stage when the volume of data has begun to exceed the volume supportable in cache, and so disk activity begins to be required for GET operations with increasing frequency.
-### Half-Size Object, SSDs, No Sync-On-Write
+### Half-Size Object, SSDs, No Sync-On-Write
This test has the following specific characteristics
@@ -92,7 +92,7 @@ Riak + leveled | Riak + eleveldb
 | 
-### Double-Size Object, SSDs, No Sync-On-Write
+### Double-Size Object, SSDs, No Sync-On-Write
This test has the following specific characteristics
@@ -124,7 +124,7 @@ Test Description | Hardware | Duration |Avg TPS | Delta
:---------------------------------|:-------------|:--------:|----------:|-----------------:|-------------------:
8KB value, 60 workers, sync | 5 x i2.2x | 4 hr | 12,679.91 | + 70.81% | + 63.99%
8KB value, 100 workers, no_sync | 5 x i2.2x | 6 hr | 14,100.19 | + 16.15% | + 35.92%
-8KB value, 50 workers, no_sync | 5 x d2.2x | 4 hr | 10,400.29 | + 8.37% | + 23.51%
+8KB value, 50 workers, no_sync | 5 x d2.2x | 4 hr | 10,400.29 | + 8.37% | + 23.51%
4KB value, 100 workers, no_sync | 5 x i2.2x | 6 hr | 14,993.95 | - 10.44% | - 4.48%
16KB value, 60 workers, no_sync | 5 x i2.2x | 6 hr | 11,167.44 | + 80.48% | + 113.55%
@@ -138,7 +138,7 @@ As a general rule, looking at the resource utilisation during the tests, the fol
- Write amplification is the primary delta in disk contention between leveldb and leveled - as leveldb is amplifying the writing of values not just keys it is creating a significantly larger 'background noise' of disk activity, and that noise is sufficiently variable that it invokes response time volatility even when r and w values are less than n.
- leveled has substantially lower tail latency, especially on PUTs.
- leveled throughput would be increased by adding concurrent workers, and increasing the available CPU.
-- leveldb throughput would be increased by having improved disk i/o.
+- leveldb throughput would be increased by having improved disk i/o.
## Riak Cluster Test - Phase 2
@@ -155,9 +155,46 @@ Testing during a journal compaction window
## Riak Cluster Test - Phase 4
-to be completed ..
+Testing with secondary indexes provides some new challenges:
+
+- Including index entries on updates accelerates the rate of compaction of the merge tree, as each update into the merge tree is now multiple changes;
+- The 2i query needs an efficient snapshot to be taken of the store, and multiple such snapshots to exist in parallel;
+- The coverage nature of the query means that the query operation is now as slow as the slowest of the vnodes (out of 22 nodes with a ring size of 64), rather than being as slow as the second slowest of three vnodes. Coverage queries are reactive to local resource pressure, whereas standard GET flows can work around such pressure.
+
+The secondary index test was built on a test which sent
+
+- 100 GET requests;
+- 10 update requests (i.e. GET then PUT) for 8KB fixed sized objects;
+- 10 update requests (i.e. GET then PUT) for 8KB fixed sized objects with four index entries each (leading to 4 index changes for a new request, 8 index changes for an update request);
+- 1 2i range query on the "date of birth" index with return_terms enabled;
+- 1 2i range query on the "postcode index" with return_terms enabled.
+
+The query load is relatively light compared to GET/PUT load in-line with Basho recommendations (decline from 350 queries per second to 120 queries per second through the test). The queries
+return o(1000) results maximum towards the tail of the test and o(1) results at the start of the test.
+
+Riak + leveled | Riak + leveldb
+:-------------------------:|:-------------------------:
+ | 
+
+The results are similar as to previous tests. Although the test is on infrastructure with optimised disk throughput (and with no flushing to disk on write from Riak to minimise direct pressure from Riak), when running the tests with leveldb disk busyness rapidly becomes a constraining factor - and the reaction to that is volatility in throughput. Riak combined with leveldb is capable in short bursts of greater throughput than Riak + leveled, however when throttled within the cluster by a node or nodes with busy disks, the reaction is extreme.
+
+The average throughputs hour by hour are:
+
+Hour |Riak + leveled | Riak + leveldb | Delta
+:--------|-------------------------:|-------------------------:|--------
+1 | 14,465.64 | 16,235.83 | - 10.90%
+2 | 10,600.16 | 11,813.14 | - 10.27%
+3 | 8,605.89 | 7,895.84 | + 18.63%
+4 | 9,366,19 | 7,225.25 | + 19.11%
+5 | 8,180.88 | 6,826.37 | + 19.84%
+6 | 7,916.19 | 6,469.73 | + 22.36%
+
+The point of inflection when leveled-backed Riak begins to out-perform leveledb-backed Riak is related to the total size of the database going beyond the size of the in-memory cache. The need to go to disk for reads as well as writes amplifies the disk busyness problem associated with write amplification.
+
+As before, increasing the available CPU will increase the potential throughput of a leveled-backed cluster. Increasing the disk IOPS rate will increase the potential throughput of a leveldb-backed cluster. Increasing object sizes, will extend the throughput advantage of leveled and decreasing sizes will improve the performance of leveldb. Increasing the size of the database relative to cache may increase the throughput advantage of leveled - but this may not hold true for all work profiles (for example those frequently accessing recently updated objects).
+
+In the 2i query test there is the same significant reduction in tail latency when comparing leveled with leveldb. For the 2i queries there is also a significant reduction in mean latency - a 57.96% reduction over the course of the test. However this reduction is not a sign that leveled can resolve 2i queries faster than leveldb - it is related to the problem of tail latency. The minimum response time for a 2i query in leveldb drifts from 4ms to 14ms as the test progresses - whereas the minimum response time for Riak and leveled fluctuates from 5ms through to 20ms at the end of the test. Outside of resource pressure leveldb will give a faster response - but resource pressure is much more likely with leveldb.
-Testing for load including 2i queries
## Riak Cluster Test - Phase 5
diff --git a/test/volume/cluster_2i/output/summary_leveldb_5n_80t_i2_nosync_2i.png b/test/volume/cluster_2i/output/summary_leveldb_5n_80t_i2_nosync_2i.png
new file mode 100644
index 0000000..4b9ebf0
Binary files /dev/null and b/test/volume/cluster_2i/output/summary_leveldb_5n_80t_i2_nosync_2i.png differ
diff --git a/test/volume/cluster_2i/output/summary_leveled_5n_80t_i2_nosync_2i.png b/test/volume/cluster_2i/output/summary_leveled_5n_80t_i2_nosync_2i.png
new file mode 100644
index 0000000..a976c30
Binary files /dev/null and b/test/volume/cluster_2i/output/summary_leveled_5n_80t_i2_nosync_2i.png differ