Discovered a bug with search ranges in leveled_tree - this was uncovered by an intermittently fialing 19.3 test.
Test case added and bug fixed. It was due to a fialure to use end_key passed causing issues with particular manifests and full bucket ranges.
Introduce a dedicated module for all the different fold types. Also simplify the list of folders by deprecating those folds that should eb achieveable by fold_heads/fold_objects type folds but with smarter functions.
Makes sure that the fold functiosn also have better spec coverage, and are dialyzer checked.
As descibed in https://github.com/martinsumner/leveled/issues/92
Only the first fix was made.
Just to eb safe - archiving means renaming to another file with a different extension. Assumption is that renamed files cna be manually reaped if necessary.
Idea being that sometimes you may wish to compare a tictac tree between leveled and something that doesn't understand erlang:phash or term_to_binary. So allow the magic_hash to be used instead - and perhaps an extract function that does base64 encoding or something similar.
The "small" tree will serialise to 1.5MB - which seems large. Much smaller trees seem to be more suitable for things like recently modified aae indexes.
this required a switch to change the sync strategy based on rebar parameter.
However tests could be slow on macbook with OTP16 and sync - so timeouts added in unit tests, and ct tests sync_startegy changed to not sync for OTP16.
Obviously got totally messed up and confused when testing previous
commits.
Multiple tests were failing for a change which got merged in as the
tests were not reflecting the required API.
Comparing the inbuilt tictac_tree fold, to using "proper" abstraction and achieving the same thing through fold_heads.
The fold_heads method is slower (a lot more manipulation required in the fold) - expect it to require > 2 x CPU.
However, this does give the flexibility to change the hash algorithm. This would allow for a fold over a database of AAE trees (where the hash has been pre-computed using sha) to be compared with a fold over a database of leveled backends.
Also can vary whether the fold_heads checks for presence of the object in the Inker. So normally we can get the speed advantage of not checking the Journal for presence, but periodically we can.
Build the AAE tree equally using fold_heads. This is a pre-cursor to running this within Riak.
In part this leans on some of the work done to improve standard Riak AAE with leveled. When rebuilding the standard AAE store only the head is required, and so this process was switched in riak_kv_sweeper to make a fold_heads request if supported by the backend.
The head response is a proxy object, which when loaded into a riak_object will allow for access to object metadata, but will use the passed function if access to object contents is requested.
Need to know {Bucket, Key} not just Key if all buckets are being covered
by nrt aae. So shoehorning this in - will also allow for proper use of
FilterFun when filtering by partition.
With basic ct test.
Doesn't currently prove expiry of index. Doesn't prove ability to find
segments.
Assumes that either "all" buckets or a special list of buckets require
indexing this way. Will lead to unexpected results if the same bucket
name is used across different Tags.
The format of the index has been chosen so that hopeully standard index
features can be used (e.g. return_terms).
Just some initial WIP code for this. Will revisit this again after
exploring some ideas as to how to reduce the cost of the
get_keys_by_segment.
The overlal idea is that there are trees of recent modifications, with
recent being some rolling time window made up of hourly blocks, and
recency being dtermined by the last-modified date on the object metadata
- which should be conistent across a cluster.
So if we were at 15:30 we would get the tree for 14:00 - 15:00 and the
tree for 15:00-16:00 from two different queries which cover the same
partitions and then compare.
Comparison may find differences, and we know what segment the difference
is in - but how to then find all keys in that segment which have been
modified in the period? Three ways:
Do it inefficeintly and infrequently using a fold_keys and a filter
(perhaps with SST files having a highest LMD in the metadata so that
they can be skipped).
Add a special index, where verye entry has a TTL, and the Key is
{$segment, Segment, Bucket, Key} so that a normal 2i query cna be used.
Align hashing for segments with hashing for penciller lookup so that a
query over the actual keys cna be optimised skipping chunks of the
in-memory part, and chunks of the SST file
Allow tictac tree sizes to be flexible.
Tested lots of different sizes. Having both level 1 and level 2 the
same size seemed to be consistently quicker than trying to make either
of the levels relatively wider.
There's an 8% performance improvement if the SegmentCount is reduced by
a quarter.