From e55571697a3a094aeabeed8aa1f27ed0a16a912a Mon Sep 17 00:00:00 2001
From: martinsumner <martin.sumner@adaptip.co.uk>
Date: Thu, 2 Feb 2017 12:01:14 +0000
Subject: [PATCH] Extend the Intro

---
 docs/INTRO.md | 26 ++++++++++++++++++++++----
 1 file changed, 22 insertions(+), 4 deletions(-)

diff --git a/docs/INTRO.md b/docs/INTRO.md
index 913b2f7..6175481 100644
--- a/docs/INTRO.md
+++ b/docs/INTRO.md
@@ -20,7 +20,7 @@ Consideration on design trade-offs for LevelEd should likewise start with a view
 
 ### Modern Hardware Costs
 
-Based on the experience of running Riak at scale in production for the NHS, what has been noticeable is that although on the servers we use around 50% of the cost goes on disk in almost all scenarios where the system is pushed to limits the limit first hit is in disk throughput.  
+Based on the experience of running Riak at scale in production for the NHS, what has been noticeable is that although on the servers used around 50% of the cost goes on disk-related expenditure, in almost all scenarios where the system is pushed to limits the limit first hit is in disk throughput.  
 
 The purchase costs of disk though, do not accurately reflect the running costs of disks - because disks fail, and fail often.  Also the hardware choices made by the NHS for the Spine programme, do not necessarily reflect the generic industry choices and costs.
 
@@ -50,13 +50,13 @@ By comparing these prices we can surmise that the relative costs are:
 
 If a natural ratio in a database server is 1 CPU: 10GB RAM: 200GB disk - this would give a proportional cost of the disk of 80% for SSD and 25% for HDD.  
 
-Compared to the figures at the time of the LSM-Tree paper actual the delta in the per-byte cost of memory and the per-byte costs of disk space has closed significantly, even when using the lowest cost disk options.  This may reflect changes in the pace of technology advancement, or just the fact that maintenance cost associated with different failure rates is now more correctly priced.  
+Compared to the figures at the time of the LSM-Tree paper, the actual delta in the per-byte cost of memory and the per-byte costs of disk space has closed significantly, even when using the lowest cost disk options.  This may reflect changes in the pace of technology advancement, or just the fact that maintenance cost associated with different failure rates is now more correctly priced.  
 
-The availability of SDDs is not a silver bullet to disk i/o problems when cost is considered, as although they eliminate the additional costs of random page access through the removal of the disk movement overhead (of about 6.5ms per movement), this benefit is at an order of magnitude difference in cost compared to spinning disks, and at a cost greater than half the price of DRAM.
+The availability of SDDs is not a silver bullet to disk i/o problems when cost is considered, as although they eliminate the additional costs of random page access through the removal of the disk head movement overhead (of about 6.5ms per shift), this benefit is at an order of magnitude difference in cost compared to spinning disks, and at a cost greater than half the price of DRAM.  SSDs have not taken the problem of managing the overheads of disk persistence away, they've simply added another dimension to the economic profiling problem.
 
 In physical on-premise server environments there is also commonly the cost of disk controllers.  Disk controllers bend the economics of persistence through the presence of flash-backed write caches.  However, disk controllers also fail - within the NHS environment disk controller failures are the second most common device failure after individual disks.  Failures of disk controllers are also expensive to resolve, not being hot-pluggable like disks, and carrying greater risk of node data-loss due to either bad luck or bad process during the change.  It is noticeable that EC2 does not have disk controllers and given their failure rate and cost of recovery, this appears to be a sensible trade-off.
 
-In building economical efficient systems which will depend on persistence, their is a clear advantage in building systems which perform well on traditional hard-disk drives.  Given the high relative cost of SDDs, where a large volume of space is required, it is worthwhile to think about stages of storage where large volumes of space is required - can by design the database space be split into 'hot' and 'cold' spaces such that the majority of storage is in the cold space, but the majority of access is in the hot space.
+Making cost-driven decisions about storage design remains as relevant now as it was two decades ago when the LSM-Tree paper was published, especially as we can now directly see those costs reflected in hourly resource charges.
 
 ### eleveldb Evolution
 
@@ -64,6 +64,22 @@ The evolution of leveledb in Riak, from the original Google-provided store to th
 
 The original leveledb considered in part the hardware economics of the phone where there are clear constraints around CPU usage - due to both form-factor and battery life, and where disk space may be at a greater premium than disk IOPS.  Some of the evolution of eleveldb is down to the Riak-specific problem of needing to run multiple stores on a single server, where even load distribution may lead to a synchronisation of activity.  Much of the evolution is also about how to make better use of the continuous availability of CPU resource, in the face of the relative scarcity of disk resource.  Changes such as overlapping files at level 1, hot threads, compression improvements etc all move eleveldb in the direction of being easier on disk at the cost of CPU; and the hardware economics of servers would indicate this is a wise choice
 
+### Planning for LevelEd
+
+The primary design differentiation between LevelEd and LevelDB is the separation of the key store (known as the Ledger in LevelEd) and the value store (known as the journal).  The Journal is like a continuous extension of the nursery log within LevelDB, only with a gradual evolution into [CDB files](https://en.wikipedia.org/wiki/Cdb_(software)) so that file offset pointers are not required to exist permanently in memory.  The Ledger is a merge tree structure, with values substituted with metadata and a sequence number - where the sequence number can be used to find the value in the Journal.
+
+This is not an original idea, the LSM-Tree paper specifically talked about the trade-offs of placing identifiers rather than values in the merge tree:
+
+> To begin with, it should be clear that the LSM-tree entries could themselves contain records rather than RIDs pointing to records elsewhere on disk. This means that the records themselves can be clustered by their keyvalue. The cost for this is larger entries and a concomitant acceleration of the rate of insert R in bytes per second and therefore of cursor movement and total I/O rate H. 
+
+The reasoning behind the use of this structure is an attempt to differentiate more clearly between a (small) hot database space (the Ledger) and a (much larger) cold database space (the non-current part of the Journal) so that through use of page cache, or faster disk the hot part of the database can be optimised for rapid access.
+
+In parallel to this work, there has also been work published on [WiscKey](https://www.usenix.org/system/files/conference/fast16/fast16-papers-lu.pdf) which explores precisely this trade-off.
+
+There is an additional optimisation that is directly relevant to Riak.  Riak always fetches both Key and Value as a GET operation within a cluster, but theoretically in many cases, where the object is not a sibling, this is not necessary.  For example when GETting an object to perform a PUT, only the vector clock and index specs are actually necessary if the object is not a CRDT.  Also when performing a Riak GET operation the value is fetched three times, even if there is no conflict between the values.
+
+So the hypothesis that separating Keys and Values may be optimal for LSM-Trees in general is potentially extendable for Riak, where there exists the potential in the majority of read requests to replace the GET of a value with a lower cost HEAD request for just Key and Metadata.   
+
 ## Being Operator Friendly
 
 The LSM-Tree paper focuses on hardware trade-offs in database design.  LevelEd is focused on the job of being a backend to a Riak database, and the Riak database is opinionated on the trade-off between developer and operator productivity.  Running a Riak database imposes constraints and demands on developers - there are things the developer needs to think hard about: living without transactions, considering the resolution of siblings, manual modelling for query optimisation.  
@@ -72,6 +88,8 @@ However, in return for this pain there is great reward, a reward which is gifted
 
 Developments on Riak of the past few years, in particular the introduction of CRDTs, have made some limited progress in easing the developer headaches.  No real progress has been made though in making things more operator friendly, and although operator sleep patterns are the primary beneficiary of a Riak installation, that does not mean to say that things cannot be improved.
 
+### Planning For LevelEd
+
 The primary operator improvements sought are:
 
 - Increased visibility of database contents.  Riak lacks efficient answers to simple questions about the bucket names which have been defined and the size and space consumed by different buckets.