Returning bucket when bucket is all

Need to know {Bucket, Key} not just Key if all buckets are being covered
by nrt aae.  So shoehorning this in - will also allow for proper use of
FilterFun when filtering by partition.
This commit is contained in:
martinsumner 2017-07-03 18:03:13 +01:00
parent b143ea1c08
commit 97fdd36d53
5 changed files with 70 additions and 19 deletions

View file

@ -3,14 +3,18 @@
Leveled is primarily designed to be a backend for Riak. Riak has a number of anti-entropy mechanisms for comparing database state within and across clusters, as part of exploring the potential for improvements in these anti-entropy mechanisms - some features have been added directly to Leveled. The purpose of these features is to:
- Allow for the database state within in a Leveled store or stores to be compared with an other store or stores which should have the same data;
- Allow for quicker checking that recent changes to one store or stores have also been received by another store or stores that should be receiving the same changes.
The aim is to use these backend capabilities to allow for a Riak anti-entropy mechanism with the following features:
- Comparison can be made between clusters with different ring-sizes - comparison is not coupled to partitioning.
- Comparison can use a consistent approach to compare state within and between clusters.
- Comparison does not rely on duplication of database state to a separate store, with further anti-entropy required to manage state variance between the actual and anti-entropy stores.
- Comparison of state can be abstracted from Riak specific implementation so that mechanisms to compare between riak clusters can be re-used to compare between a Riak cluster and another database store. Coordination with another data store (e.g. Solr) can be controlled by the Riak user not the Riak developer.
- Comparison of state can be abstracted from Riak specific implementation so that mechanisms to compare between Riak clusters can be re-used to compare between a Riak cluster and another database store. Coordination with another data store (e.g. Solr) can be controlled by the Riak user not the Riak developer.
## Merkle Trees
@ -18,15 +22,43 @@ Riak has historically used [Merkle trees](https://en.wikipedia.org/wiki/Merkle_t
> A hash tree is a tree of hashes in which the leaves are hashes of data blocks in, for instance, a file or set of files. Nodes further up in the tree are the hashes of their respective children. For example, in the picture hash 0 is the result of hashing the concatenation of hash 0-0 and hash 0-1. That is, hash 0 = hash( hash 0-0 + hash 0-1 ) where + denotes concatenation.
A side effect of the concatenation decision is that trees cannot be calculated incrementally, when elements are not ordered by segment. To calculate the hash of an individual leaf (or segment), the hashes of all the elements under that leaf must be accumulated first.
A side effect of the concatenation decision is that trees cannot be calculated incrementally, when elements are not ordered by segment. To calculate the hash of an individual leaf (or segment), the hashes of all the elements under that leaf must be accumulated first. In the case of the leaf segments in Riak, the leaf segments are made up of a hash of the concatenation of {Key, Hash} pairs under that leaf:
``hash([{K1, H1}, {K2, H2} .. {Kn, Hn}])``
This requires all of the keys and hashes need to be pulled into memory to build the hashtree - unless the tree is being built segment by segment. The Riak hashtree data store is therefore ordered by segment so that it can be incrementally built. The segments which have had key changes are tracked, and at exchange time all "dirty segments" are re-scanned in the store segment by segment, so that the hashtree can be rebuilt.
## Tic-Tac Trees
Anti-entropy in leveled is supported using the leveled_tictac module. This module uses a less secure form of merkle trees that don't prevent information from leaking out, but allow for the trees to be built incrementally, and trees built incrementally to be merged. These trees we're calling Tic-Tac Trees after the [Tic-Tac language](https://en.wikipedia.org/wiki/Tic-tac) which has been historically used on racecourses to communicate the state of the market between participants; although the more widespread use of mobile communications means that the use of Tic-Tac is petering out, and rather like Basho employees, there are now only three Tic-Tac practitioners left.
Anti-entropy in leveled is supported using the leveled_tictac module. This module uses a less secure form of merkle trees that don't prevent information from leaking out, or make the tree tamper-proof, but allow for the trees to be built incrementally, and trees built incrementally to be merged. These trees we're calling Tic-Tac Trees after the [Tic-Tac language](https://en.wikipedia.org/wiki/Tic-tac) which has been historically used on racecourses to communicate the state of the market between participants; although the more widespread use of mobile communications means that the use of Tic-Tac is petering out, and rather like Basho employees, there are now only three Tic-Tac practitioners left.
The change from Merkle trees to Tic-Tac trees is simply to no longer use a cryptographically strong hashing algorithm, and now combine hashes through XORing rather than concatenation. So a segment leaf is calculated from:
``hash(K1, H1) XOR hash(K2, H2) XOR ... hash(Kn, Hn)``
The Keys and hashes can now be combined in any order with any grouping.
This enables two things:
- The tree can be built incrementally when scanning across a store not in segment order (i.e. scanning across a store in key order) without needing to hold an state in memory beyond the size of the tree.
- Two trees from stores with non-overlapping key ranges can be merged to reflect the combined state of that store i.e. the trees for each store can be built independently and in parallel and the subsequently merged without needing to build an interim view of the combined state.
It is assumed that the trees will only be transferred securely between trusted actors already permitted to view, store and transfer the real data: so the loss of cryptographic strength of the tree is irrelevant to the overall security of the system.
## Recent and Whole
Anti-entropy in Riak is a dual-track process:
- there is a need to efficiently and rapidly provide an update on store state that represents recent additions;
- there is a need to ensure that the anti-entropy view of state represents the state of the whole database.
Within the current Riak AAE implementation, recent changes are supported by having a separate anti-entropy store organised by segments so that the Merkle tree can be updated incrementally to reflect recent changes. The Merkle tree produced following these changes should then represent the whole state of the database.
However as the view of the whole state is maintained in a different store to that holding the actual data: there is an entropy problem between the actual store and the AAE store e.g. data could be lost from the real store, and go undetected as it is not lost from the AAE store. So periodically the AAE store is rebuilt by scanning the whole of the real store. This rebuild can be an expensive process, and the cost is commonly controlled through performing this task infrequently, with changes pending in develop to try and manage the scheduling and throttling of this process.
The change from Merkle trees to Tic-Tac trees is simply to no longer use a cryptographically strong hashing algorithm, and now combine hashes through XORing rather than concatenation - to enable merging and incremental builds.
## Divide and Conquer
.... to be completed

View file

@ -1333,8 +1333,6 @@ accumulate_objects(FoldObjectsFun, InkerClone, Tag, DeferredFetch) ->
AccFun.
check_presence(Key, Value, InkerClone) ->
{LedgerKey, SQN} = leveled_codec:strip_to_keyseqonly({Key, Value}),
case leveled_inker:ink_keycheck(InkerClone, LedgerKey, SQN) of

View file

@ -71,7 +71,7 @@
-define(MAGIC, 53). % riak_kv -> riak_object
-define(LMD_FORMAT, "~4..0w~2..0w~2..0w~2..0w~2..0w").
-define(NRT_IDX, "$aae.").
-define(ALL_BUCKETS, list_to_binary("$all")).
-define(ALL_BUCKETS, <<"$all">>).
-type recent_aae() :: #recent_aae{}.
@ -181,9 +181,10 @@ is_active(Key, Value, Now) ->
false
end.
from_ledgerkey({Tag, Bucket, {_IdxField, IdxValue}, Key})
when Tag == ?IDX_TAG ->
{Bucket, Key, IdxValue};
from_ledgerkey({?IDX_TAG, ?ALL_BUCKETS, {_IdxFld, IdxVal}, {Bucket, Key}}) ->
{Bucket, Key, IdxVal};
from_ledgerkey({?IDX_TAG, Bucket, {_IdxFld, IdxVal}, Key}) ->
{Bucket, Key, IdxVal};
from_ledgerkey({_Tag, Bucket, Key, null}) ->
{Bucket, Key}.
@ -393,8 +394,22 @@ gen_indexspec(Bucket, Key, IdxOp, IdxField, IdxTerm, SQN, TTL) ->
%% TODO: timestamps for delayed reaping
tomb
end,
{to_ledgerkey(Bucket, Key, ?IDX_TAG, IdxField, IdxTerm),
{SQN, Status, no_lookup, null}}.
case Bucket of
{all, RealBucket} ->
{to_ledgerkey(?ALL_BUCKETS,
{RealBucket, Key},
?IDX_TAG,
IdxField,
IdxTerm),
{SQN, Status, no_lookup, null}};
_ ->
{to_ledgerkey(Bucket,
Key,
?IDX_TAG,
IdxField,
IdxTerm),
{SQN, Status, no_lookup, null}}
end.
-spec aae_indexspecs(false|recent_aae(),
any(), any(),
@ -418,7 +433,7 @@ aae_indexspecs(AAE, Bucket, Key, SQN, H, LastMods) ->
Bucket0 =
case AAE#recent_aae.buckets of
all ->
?ALL_BUCKETS;
{all, Bucket};
ListB ->
case lists:member(Bucket, ListB) of
true ->
@ -545,7 +560,7 @@ get_size(PK, Value) ->
-spec get_keyandobjhash(tuple(), tuple()) -> tuple().
%% @doc
%% Return a tucple of {Bucket, Key, Hash} where hash is a has of the object
%% Return a tucple of {Bucket, Key, Hash} where hash is a hash of the object
%% not the key (for example with Riak tagged objects this will be a hash of
%% the sorted vclock)
get_keyandobjhash(LK, Value) ->

View file

@ -43,6 +43,7 @@
find_journals/1,
wait_for_compaction/1,
foldkeysfun/3,
foldkeysfun_returnbucket/3,
sync_strategy/0]).
-define(RETURN_TERMS, {true, undefined}).
@ -484,6 +485,11 @@ get_randomdate() ->
foldkeysfun(_Bucket, Item, Acc) -> Acc ++ [Item].
foldkeysfun_returnbucket(Bucket, {Term, Key}, Acc) ->
Acc ++ [{Term, {Bucket, Key}}];
foldkeysfun_returnbucket(Bucket, Key, Acc) ->
Acc ++ [{Bucket, Key}].
check_indexed_objects(Book, B, KSpecL, V) ->
% Check all objects match, return what should be the results of an all
% index query

View file

@ -106,11 +106,11 @@ many_put_compare(_Config) ->
FoldKeysFun =
fun(SegListToFind) ->
fun(_B, K, Acc) ->
fun(B, K, Acc) ->
Seg = leveled_tictac:get_segment(K, SegmentCount),
case lists:member(Seg, SegListToFind) of
true ->
[K|Acc];
[{B, K}|Acc];
false ->
Acc
end
@ -586,7 +586,7 @@ recent_aae_allaae(_Config) ->
IdxQ1 =
{index_query,
<<"$all">>,
{fun testutil:foldkeysfun/3, []},
{fun testutil:foldkeysfun_returnbucket/3, []},
{IdxLMD,
list_to_binary(TermPrefix ++ "."),
list_to_binary(TermPrefix ++ "|")},
@ -619,7 +619,7 @@ recent_aae_allaae(_Config) ->
true = length(DeltaX) == 0, % This hasn't seen any extra changes
true = length(DeltaY) == 1, % This has seen an extra change
[{_, K1}] = DeltaY,
[{_, {B1, K1}}] = DeltaY,
ok = leveled_bookie:book_close(Book2A),
ok = leveled_bookie:book_close(Book2B),