Use 4 keys in the bloom (which is closer to optimal size). This should halve the fpr - as we cna now use the large ExtraHash rather than being constrained by the SegmentHash here.
Switch from magic hash to md5 - to hopefully remove the need for some
of the artificial jumps required to get expected fall positive ratios.
Also split the hash into two 16-bit integers. We assume that SegmentID
(from the perspective of AAE merkle/tictac trees) will always be at
least 16 bits. the idea is that hashes should be used in blooms and
indexes such that some advantage can be gained from just knowing the
segmentID - in particular when folding over all the keys in a bucket.
Performance testing has been difficult so far - I think due to “cloud”
mysteries.
As descibed in https://github.com/martinsumner/leveled/issues/92
Only the first fix was made.
Just to eb safe - archiving means renaming to another file with a different extension. Assumption is that renamed files cna be manually reaped if necessary.
Because there's no sensible way of using it if objects are mutable - you still end up with the same false positives in the tictactree.
Didn't fully rollback the change as spec and docs were added which chould be useful going forward.
This is an interim stage towwards enhancing the proxy object so that it contains more helper information (other than size).
The aim is to be able to run more efficient fold_heads queries that might filter on LMD range (so as not to have to co-ordinate the running of comparative queries). For example if producing a tictactree to compare between two different offsets, a max LMD could be passed in so that changes beyond the time the first query was requested can be ignored.
Idea being that sometimes you may wish to compare a tictac tree between leveled and something that doesn't understand erlang:phash or term_to_binary. So allow the magic_hash to be used instead - and perhaps an extract function that does base64 encoding or something similar.
Need to know {Bucket, Key} not just Key if all buckets are being covered
by nrt aae. So shoehorning this in - will also allow for proper use of
FilterFun when filtering by partition.
With basic ct test.
Doesn't currently prove expiry of index. Doesn't prove ability to find
segments.
Assumes that either "all" buckets or a special list of buckets require
indexing this way. Will lead to unexpected results if the same bucket
name is used across different Tags.
The format of the index has been chosen so that hopeully standard index
features can be used (e.g. return_terms).
Remove the ets index in pmem and use a binary index instead. This may
be slower, but avoids the bulk upload to ets, and means that matches
know of position (so only skiplists with a match need be tried).
Also stops the discrepancy between snapshots and non-snapshots - as
previously the snapshots were always slowed by not having access to the
ETS table.
The full riak metadata had been stripped from the Ledger update for
performance reasons. However, the full metadata is required in order to
save a GET before a PUT. Therefore we want to do isolated testing on
this change to establish the relative cost value in that cost saving.
It is expensive on the CPU - but it leads to a 4 x increase in the cache
coverage.
Try and make some small micro gains in list handling in create_block
Move to using the DJ Bernstein Magic Hash consistently, and trying to
make sure we only hash once for each operation (as the hash is more
expensive than phash2).
The improved lookup time for missing keys should allow for the L0 index
to be removed, and hence speed up the completion time for push_mem
operations.
It is expected there will be a second stage of creating a tinybloom as
part of the SFT creation process, and then adding that tinybloom to the
manifest. This will then reduce the message passing required for a GET
not in the cache or higher levels
Change the extract of Riak metadata.
In Riak-based volume tests hte writing of SFT files is tanking. Could
this be the "extra" metadata. i.e. There are only current plans to look
at the vclock. Sibling count is free to fetch, what if we just get
these two items, will it be less CPU to extract the metadata, but also
will the reduced weight reduce the downstream impact?