More word wrangling

This commit is contained in:
martinsumner 2017-07-07 10:23:21 +01:00
parent 417687c004
commit d5a365b368

View file

@ -76,7 +76,7 @@ The AAE process in production system commonly raises false positives (prompts re
### Proposed Leveled AAE
The first stage in considering an alternative approach to anti-entropy, was to question the necessity of having a dedicated AAE database that needs to reflect all key changes in the actual vnode store. A separate store can have features such as being sorted by segment ID that make that store easier to scan for rebuilds of the tree. By contrast, there are three primary costs with scanning over the primary database:
The first stage in considering an alternative approach to anti-entropy, was to question the necessity of having a dedicated AAE database that needs to reflect all key changes in the actual vnode store. A separate store can have features such as being sorted by segment ID that make that store easier to scan for rebuilds of the tree: hence avoiding the three main costs with scanning over the primary database:
- the impact on the page cache as all keys and values have to be read from disk, including not-recently used values;
@ -88,13 +88,17 @@ The third cost can be addressed by the fold output being an incrementally updata
The [testing of traditional Riak AAE](https://github.com/martinsumner/leveled/blob/master/docs/VOLUME.md#leveled-aae-rebuild-with-journal-check) already undertaken has shown that scanning the database is not necessarily such a big issue in Leveled. So it does seem potentially feasible to scan the store on a regular basis. The testing of Leveldb with the riak_kv_sweeper feature shows that with the improved throttling more regular scanning is also possible here: testing with riak_kv_sweeper managed to achieve 10 x the number of sweeps, with only a 9% drop in throughput.
A hypothesis is proposed that regular scanning of the full store to produce a Tic-Tac tree is certainly feasible in Leveled, but also potentially tolerable in other back-ends. However, <b>frequent</b> scanning is likely to still be impractical. It is therefore suggested that there should be an alternative form of anti-entropy that can be run in addition to scanning, that is lower cost and can be run be frequently in support of whole database scanning. This additional anti-entropy mechanism would focus on the job of verifying that <b>recent</b> changes have been received. So there would be two anti-entropy mechanisms, one which can be run frequently (minutes) to check for the receipt of recent changes, and one that can be run regularly but infrequently (hours/days) to check that full database state is consistent.
A hypothesis is proposed that regular scanning of the full store to produce a Tic-Tac tree is certainly feasible in Leveled, but also potentially tolerable in other back-ends. However, <b>frequent</b> scanning is likely to still be impractical. If it is not possible to scan the database frequently, if a recent failure event has led to a discrepancy between stores, this will not be detected in a timely manner. It is therefore suggested that there should be an alternative form of anti-entropy that can be run in addition to scanning, that is lower cost and can be run frequently in support of whole database scanning. This additional anti-entropy mechanism would focus on the job of verifying that <b>recent</b> changes have been received.
It is proposed to compare full database state by scanning the actual store, but producing a Tic-Tac tree as the outcome, one that can be merged across partitions through a coverage query to provide an overall view of the database state. This view could be compared with different coverage query offsets within the same cluster, and with different replicated clusters.
So there would be two anti-entropy mechanisms, one which can be run frequently (minutes) to check for the receipt of recent changes, and one that can be run regularly but infrequently (hours/days) to check that full database state is consistent.
It is proposed to compare full database state by scanning the actual store, but producing a Tic-Tac Merkle tree as the outcome, one that can be merged across partitions through a coverage query to provide an overall view of the database state. This view could be compared with different coverage query offsets within the same cluster, and with different replicated clusters.
To provide a check on recent changes it is proposed to add a temporary index within the store, with an entry for each change that is built from a rounded last modified date and the hash of the value, so that the index can be scanned to form a Tic-Tac tree of recent changes. This assumes that each object has a Last Modified Date that is consistent (for that version) across all points where that particular version is stored, to use as the field name for the index. The term of the index is based on the segment ID (for the tree) and the hash of the value. This allows for a scan to build a tree of changes for a given range of modified dates, as well as a scan for keys and hashes to be returned for a given segment ID and date range.
Within the Leveled the index can be made temporary by giving the entry a time-to-live, independent of any object time to live. So once the change is beyond the timescale in which the operator wishes to check for recent changes, it will naturally be removed from the database (through deletion on the next compaction event that hits the entry in the Ledger). Therefore in the long-term, there is no need to maintain additional state outside of the primary database stores, in order to manage anti-entropy.
As this index only covers recent changes, it will be limited in size, and mainly in-memory, and so it can be scanned frequently in a cost-effective manner to both gather trees for comparison, and discover Keys in segments with variations.
Within the Leveled the index can be made temporary by giving the entry a time-to-live, independent of any object time to live. So once the change is beyond the timescale in which the operator wishes to check for recent changes, it will naturally be removed from the database (through deletion on the next compaction event that hits the entry in the Ledger). Therefore in the long-term, there is no need to maintain additional state outside of the primary database stores, in order to manage anti-entropy. This may also be possible using TTL features in leveldb.
Hence overall this should give: