Update thoughts on concurrency management

After reviewing the code it looks easier to have  separate pool of vnode
workers to manage concurrency, rather than trying to alter the coverage
FSM itself.  This will make it easier to adjust for different backend
capabilities (i.e. worst case it would fallback to unthrottled).
This commit is contained in:
martinsumner 2017-07-12 14:50:38 +01:00
parent 2adf60e974
commit cb5f09496f

View file

@ -259,7 +259,7 @@ This likely represent gaps in current understanding, rather than flaws in the ap
Some notes on re-using this alternative anti-entropy mechanism within Riak:
* There is divergence between Leveled and LevelDB with regards to how async folds are implemented. Within LevelDB requesting an async fold returns a folder function that will take a snapshot when it is called. Within Leveled the option exists to take the snapshot before returning the folder function, so that calling the folder function will work on a snapshot of the store taken when the folder was requested. This difference caused issues with testing with riak_kv_sweeeper, as the scheduling in sweeper meant that folds would be requested, and left on a queue for a long enough to be timed out by the time it was called. The quick fix for riak_kv_sweeper testing was to make the folder snapshot behaviour in Leveled consistent with LevelDB. However, the original behaviour opens up some interesting possibilities for AAE implementation in that a coverage set of vnodes could be snapshotted at a point in time, but not all folds need to be run concurrently to make the result consistent to the point in time. This would allow folds could be directly throttled by the coverage process so that only one fold was being run on each node at once, without opening up a time-gap between snapshots that would increase the number of false repairs.
* There is divergence between Leveled and LevelDB with regards to how async folds are implemented. Within LevelDB requesting an async fold returns a folder function that will take a snapshot when it is called. Within Leveled the option exists to take the snapshot before returning the folder function, so that calling the folder function will work on a snapshot of the store taken when the folder was requested. This difference caused issues with testing with riak_kv_sweeeper, as the scheduling in sweeper meant that folds would be requested, and left on a queue for a long enough to be timed out by the time it was called. The quick fix for riak_kv_sweeper testing was to make the folder snapshot behaviour in Leveled consistent with LevelDB. However, the original behaviour opens up some interesting possibilities for AAE implementation in that a coverage set of vnodes could be snapshotted at a point in time, but not all folds need to be run concurrently to make the result consistent to the point in time. This would allow folds could be directly throttled during the coverage process to manage the number of folds running on each node at once, without opening up a time-gap between snapshots that would increase the number of false repairs.
- It may be possible to make the leveldb behaviour async like leveled. The fold function contains the setup of the iterator and doing the fold, and perhaps these could be separated such that the iterator would be setup prior to the fold function being returned:
@ -268,7 +268,9 @@ Some notes on re-using this alternative anti-entropy mechanism within Riak:
{ok, Itr} = iterator(Ref, Opts, keys_only),
do_fold(Itr, Fun, Acc0, Opts).
```
- The potential would then exist for a `riak_core_rollingcoverage_fsm` as a variation on `riak_core_coverage_fsm`. Whereas `riak_core_coverage_fsm` makes a coverage request and the sits in a `waiting_results` state until all vnodes are done, the rollingcoverage version may have async folders returned to it, and the roll over each folder in turn. So all the folds will be run at a snapshot that is close to the same point in time, but only one fold is running at a time hence minimising the impact on the cluster.
- Likewise with bitcask, it currently is async with the snapshot effectively inside of the async folder function returned (for bitcask it opens a new bitcask store in read-only mode), and this could be done outside. This could be moved outside of the async part but, unlike with leveldb and leveled snapshots this is a relatively expensive operation - so this would block the main bitcask process in an unhealthy way. So finding a simple way of snapshotting prior to the fold and outside of the async process would require more work in Bitcask.
- riak_core supports vnode_worker_pools (currently only one) and riak_kv sets up a pool for folds. If riak_core were be changed to support more than one pool, a second pool could be setup for snapped folds (i.e. where the response is {snap_async, Work, From, NewModState} as opposed to [async](https://github.com/basho/riak_core/blob/2.1.8/src/riak_core_vnode.erl#L358-#L362), the second vnode_worker_pool would be asked to fulfill this work). The second pool could have a more constrained number of concurrent workers - so these large folds could have concurrency throttled, without a timing impact on the consistency of the results across vnodes.
* In Leveled a special fold currently supports the Tic-Tac tree generation for indexes, and one for objects. It may be better to support this through a offering a more open capability to pass different fold functions and accumulators into index folds. This could be re-used for "reporting indexes", where we want to count terms of different types rather than return all those terms via an accumulating list e.g. an index may have a bitmap style part, and the function will apply a wildcard mask to the bitmap and count the number of hits against each possible output.