From e81a53b539f40f18298604f9d0e3a3533143f96d Mon Sep 17 00:00:00 2001
From: Martin Sumner <martin.sumner@adaptip.co.uk>
Date: Thu, 28 Sep 2017 20:04:21 +0100
Subject: [PATCH] Initial draft of option comparison

n HEADs and 1 GET

or

1 GET and n-1 HEADs
---
 docs/FUTURE.md | 17 +++++++++++++----
 1 file changed, 13 insertions(+), 4 deletions(-)

diff --git a/docs/FUTURE.md b/docs/FUTURE.md
index 301afc3..e06b2c8 100644
--- a/docs/FUTURE.md
+++ b/docs/FUTURE.md
@@ -87,15 +87,24 @@ The alternative FSM in this branch makes the following changes
 - recall execute asking for just the necessary GET requests
 - in waiting_vnode_r the second time update HEAD responses to GET responses, and recalculate the winner (including body) when all vnodes which have been sent a GET request have responded
 
-So rather than doing three Key/Metadata/Body backend lookups for every request, normally (assuming no siblings) Riak now requires three Key/Metadata and one Key/Metadata/Body lookup.  For larger bodies, this then avoids the additional disk activity and network load associated with fetching and transmitting the body three times for every request.  There are some other side effects:
-
-- It is more likely to detect out-of-date objects in slow nodes (as the n-r responses may still be received and added to the result set when waiting for the GET response in the second loop).
-- The database is in-theory much less likely to have [TCP Incast](http://www.snookles.com/slf-blog/2012/01/05/tcp-incast-what-is-it/) issues, hence reducing the cost of network infrastructure, and reducing the risk of running Riak in shared infrastructure environments.
+So rather than doing three Key/Metadata/Body backend lookups for every request, normally (assuming no siblings) Riak now requires three Key/Metadata and one Key/Metadata/Body lookup.  For larger bodies, this then avoids the additional disk activity and network load associated with fetching and transmitting the body three times for every request.  As another positive side effect the database is in-theory much less likely to have [TCP Incast](http://www.snookles.com/slf-blog/2012/01/05/tcp-incast-what-is-it/) issues, hence reducing the cost of network infrastructure, and reducing the risk of running Riak in shared infrastructure environments.
 
 The feature will not at present work safely with legacy vclocks. This branch generally relies on vector clock comparison only for equality checking, and removes some of the relatively expensive whole body equality tests (either as a result of set:from_list/1 or riak_object:equal/2), which are believed to be a legacy of issues with pre-dvv clocks.
 
 In tests, the benefit of this may not be that significant - as the primary resource saved is disk/network, and so if these are not the resources under pressure, the gain may not be significant.  In tests bound by CPU not disks, only a 10% improvement has so far been measured with this feature.
 
+#### 1 GET n-1 HEAD
+
+A potential alternative to perfoming n HEADS and then 1 GET, would be for the FSM to make 1 GET request and in parallel make n-1 HEAD requests.  This would, in an under-utilised cluster, require less resource and is likely to be lower latency.  Perhaps the 1 GET request vnode could be selected in a similar style to the PUT FSM put coordinator by starting the FSM on a node in the preflist and using the local vnode as the GET vnode, and the remote vnodes would be chosen as the HEAD request nodes for clock comparison.
+
+The primary reason why this approach has not been chosen, and the n HEADS followed by 1 GET mode has been preferred is to do with variable vnode mailbox lengths.  
+
+When a cluster is under heavy pressure, especially when a cluster has been expanded so that there are a low number of vnodes per node, vnode mailbox sizes can vary and some vnodes may go into overload status (as the mailbox is not unbounded).  At this stage the client must back-off, and the cluster must run at the pace of the slowest vnode.  Ideally, if a vnode is slow, it should be given less work.  This is a positive advantage of the n HEAD requests followed by 1 GET, the first responder is the one elected to perform the GET, and the slower responders miss out on that workload.  This means that naturally slower vnodes (such as those with longer mailbox queues), are given less work by avoiding the expensive GET requests, and the overload scenario is likely to be avoided.
+
+The issue with performing 1 GET and n-1 HEAD requests, what if the slow vnode (say one with a mailbox 2,000 long) is selected for the GET request (assuming the GET vnode now has to be selected at random, as there has been no test HEAD message to calibrate which vnode to use).  Once the n-1 HEAD requests have completed, how long should the FSM wait for the response to the GET request, especially as the GET request may be delayed naturally due to the value being large rather than due to the vnode being slow.  If the FSM times out aggresively, then larger object requests are more likely to be made more than once.  If the timeout is loose, then there will be many delays caused by one slow vnode, even when that vnode is not in the overload state.  With the n HEAD 1 GET approach we have evidence that the vnode chosen for the GET is active and fast, and so the FSM can wait until the FSM timeout (at risk of failing a request when the fialure occurrs between the HEAD and the GET request).  The 1 GET and n-1 HEAD requests doesn't avoid the slow vnode problem, and required seeming unresolvable reasoning about timeouts if the chosen GET node does not respond quickly. 
+
+During transition, there will be a state where some vnodes support HEAD requests, and others still use eleveldb and so do not.  In this case with n HEAD requests, some vnodes will respond with the object, as the vnode will revert back to a GET request if the "head" capability is not present in the backend.  If one of the quorum HEAD responses is an object, and there is no sibling detected, then the GET request will be avoided.
+
 ### PUT -> Using HEAD
 
 Branch: [mas-leveled-putfsm](https://github.com/martinsumner/riak_kv/tree/mas-leveled-putfsm)