Previously done at Slot Level - but Blocks were still read from disk after the Slot CRC had been checked.
This seems safer. It requires an extra CRC check for every fetch. However, CRC chekcing smaller binaries during the buld process appears to be beneficial to performance.
Hoped this will be an enabler to turning off compression at Levels 0 and 1 to improve performance (wihtout having a compensating issues with reduced CRC performance)
Previouslythe tinybloom was used within the SST file as an extra check to remove false fetches.
However the SST already has a low FPR check in the slot_index. If the newebloom was used (which is no longer per slot, but per sst), this can be shared with the penciller and then the penciller could use it and avoid the message pass.
the message pass may be blocked by a 2i query or a slot fetch request for a merge. So this should make performance within the Penciller snappier.
This is as a result of taking sst_timings within a volume test - where there was an average of + 100microsecs for each level that was dropped down. Given the bloom/slot checks were < 20 microsecs - there seems to be some further delay.
The bloom is a binary of > 64 bytes - so passing it around should not require a copy.
Compression can be switched between LZ4 and zlib (native).
The setting to determine if compression should happen on receipt is now a macro definition in leveled_codec.
Note that accelerating segment_list queries will not work for tree sizes smaller than small. How to flag this up?
Should smaller tree sizes just be removed from leveled_tictac?
Initially with basic tests. If the SlotIndex has been cached, we can now use the slot index as it is based on the Segment hash algortihm.
This looks like it should lead to an order of magnitude improvement in querying for keys/clocks by segment ID.
This also required a slight tweak to the penciller keyfolder. It now caches the next answer from the SSTiter, rather than restart the iterator. When the IMMiter has many more entries than the SSTiter (as the sSTiter is being filtered but not the IMMiter) this could lead to lots of repeated folding.
More entropy by using the position index with the segment hash - so this would be a better filter to apply.
Also could increase the key count now, as extra hash can be larger.
As an aside - a leveled_iclerk unit test failure appeared - the range was just wrong. Don't know why this strated happening
Switch from magic hash to md5 - to hopefully remove the need for some
of the artificial jumps required to get expected fall positive ratios.
Also split the hash into two 16-bit integers. We assume that SegmentID
(from the perspective of AAE merkle/tictac trees) will always be at
least 16 bits. the idea is that hashes should be used in blooms and
indexes such that some advantage can be gained from just knowing the
segmentID - in particular when folding over all the keys in a bucket.
Performance testing has been difficult so far - I think due to “cloud”
mysteries.
Change to 5 blocks is intended to make the blocks in lookup slots
fractionally smaller, but more importantly to introduce a middle block
that cna be opened in a binary-split style fashion to reduce the number
of blocks that need to be opened for range queries. Worst case for
full slots is 3 blocks now not 4.
Still not clear if yielding is the cause of memory problems, but taking
it away universally has impacted throughput. At the very least we
should continue to yield on high-contention files (those at higher
levels), where the processes are more likely to be quickly terminated
anyway allowing GC to be invoked.
There was complicated and confusing code that achieved nothing for
effiency when trimming slots. the expensive part (binary_to_term) was
still needed on every block, and it was hard to get code coverage and
make sense of what it was really trying to achieve.
This is now much simpler - and may set us up for potential further
indexing help.
RTrim only worked in special case of key matching, that would never
occur in real world range query. RTrim should really check for key
passing.
Returning empty list should not be possible - unless the query is
outside of the range entirely (and such a query should never go to this
SST).