From 71d60005f3adb6f5274a14723d92d4acdd7558fb Mon Sep 17 00:00:00 2001 From: martinsumner Date: Fri, 10 Feb 2017 11:09:32 +0000 Subject: [PATCH] Some more why answers These potential optimisations occupied a lot of time at the start, and ultimately the desire to simplify rather than optimise won through. --- docs/WHY.md | 20 +++++++++++++++++++- 1 file changed, 19 insertions(+), 1 deletion(-) diff --git a/docs/WHY.md b/docs/WHY.md index 8add6cc..b00bd19 100644 --- a/docs/WHY.md +++ b/docs/WHY.md @@ -38,7 +38,7 @@ From the start I decided that fadvise would be my friend, in part as: - It was easier to depend on the operating system to understand and handle LRU management. -- I'm expecting various persisted elements (not just values, but also key ranges) to benefit from compression, so want to maximise the volume of data holdable in memory (by caching the compressed view in the page cache). +- I'm expecting various persisted elements (not just values, but also key ranges) to benefit from compression, so want to maximise the volume of data which can be held in memory (by caching the compressed view in the page cache). - I expect to reduce the page-cache polluting events that can occur in Riak by reducing object scanning. @@ -46,6 +46,24 @@ Ultimately though, sophisticated memory management is hard, and beyond my capabi The design may make some caching strategies relatively easy to implement in the future though. Each file process has its own LoopData, and to that LoopData independent caches can be added. This is currently used for caching bloom filters and hash index tables, but could be used in a more flexible way. +## Why do the Sequence Numbers not wrap? + +Each update increments the sequence number, but the sequence number does not wrap around at any stage, it will grow forever. The reasoning behind not supporting some sort of wraparound and vacuuming of the sequence number is: + +- There is no optimisation based on fitting into a specific size, so no immediate performance benefit from wrapping; +- Auto-vacuuming is hard to achieve, especially without creating a novel and unexpected operational event that is hard to capacity plan for; +- The standard Riak node transfer feature will reset the sequence number in the replacement vnode - providing a natural way to address any unexpected issue with the growth in this integer. + +## Why not store pointers in the Ledger? + +The Ledger value contains the sequence number which is necessary to lookup the value in the Journal. However, this lookup will require a tree to be walked to find the file, and then a two-stage hashtree lookup to find the pointer in the file. The location of value is know at the time the Ledger entry is written - so why not include a pointer at this stage? The Journal would then be Bitcask-like relying on external pointers. + +The issue with this mechanism would be the added complexity in journal compaction. When the journal is compacted it would be necessary to make a series of Ledger updates to reflect the changes in pointers - and this would require some sort of locking strategy to prevent a race between an external value update and an update to the pointer. + +The [WiscKey store](https://www.usenix.org/system/files/conference/fast16/fast16-papers-lu.pdf) achieves this, without being explicit about whether it blocks updates to avoid the race. + +The trade-off chosen is that the operational simplicity of decoupling references between the stores is worth the cost of the additional lookup, especially as Riak changes will reduce the relative volume of value fetches (by comparison to metadata/HEAD fetches). + ## Why make this backwards compatible with OTP16? Yes why, why do I have to do this?