* Query don't copy
Queries the manifest to avoid copying the whole manifest when taking a snapshot of a penciller to run a query.
Change the logging of fold setup in the Bookie to record the actual snapshot time (rather than the uninteresting and fast returning the the function which will request the snapshot).
A little tidy to avoid duplicating the ?MAX_LEVELS macro.
* Clarify log is of snapshot time not fold time
* Updates after review
Potentially reduce the overheads of scoring each file on every run.
The change also alters the default thresholds for compaction to favour longer runs (which will tend towards greater storage efficiency).
During EQC testing it was found that snapshots are still usable even
if the bookie process crashes. This change has snapshots monitor the
bookie and close when the bookie process dies.
Initial commit to add head_only mode to leveled. This allows leveled to receive batches of object changes, but where those objects exist only in the Penciller's Ledger (once they have been persisted within the Ledger).
The aim is to reduce significantly the cost of compaction. Also, the objects ar enot directly accessible (they can only be accessed through folds). Again this makes life easier during merging in the LSM trees (as no bloom filters have to be created).
Previouslythe tinybloom was used within the SST file as an extra check to remove false fetches.
However the SST already has a low FPR check in the slot_index. If the newebloom was used (which is no longer per slot, but per sst), this can be shared with the penciller and then the penciller could use it and avoid the message pass.
the message pass may be blocked by a 2i query or a slot fetch request for a merge. So this should make performance within the Penciller snappier.
This is as a result of taking sst_timings within a volume test - where there was an average of + 100microsecs for each level that was dropped down. Given the bloom/slot checks were < 20 microsecs - there seems to be some further delay.
The bloom is a binary of > 64 bytes - so passing it around should not require a copy.
Compression can be switched between LZ4 and zlib (native).
The setting to determine if compression should happen on receipt is now a macro definition in leveled_codec.
With basic ct test.
Doesn't currently prove expiry of index. Doesn't prove ability to find
segments.
Assumes that either "all" buckets or a special list of buckets require
indexing this way. Will lead to unexpected results if the same bucket
name is used across different Tags.
The format of the index has been chosen so that hopeully standard index
features can be used (e.g. return_terms).
When running a load of mainly 2i queries, there is a huge cost in the
previous snapshot code. The time taken to create a clone of the
Penciller (duplicating all the LoopState) varied between 1 and 200ms
depedning on the size of the LoopState.
For 2i queries, most of that LoopState was then being thrown away after
running the query against the levelzero_cache. This was taking < 1ms on
average. It would be better to avoid the o(100)ms of CPU burning and
block for o(1)ms - so th eorder of events have been changed ot filter
first so only the small part of the LoopState actually required is
copied to the clone.
Under assumption that it generates less GC noise (based on
micro-benchmark in leveled_tree eunit testing).
Note to confirm, needed to swap around the test order, and this showed
less collections in each position for skpl - and a < 10% performance hit
Going to abandond this branch for now. The change is beoming
excessively time consuming, and it is not clear that a smaller change
might not achieve more of the objectives.
All this is broken - but perhaps could get picke dup another day.
This is desirable to add back in going forward, but wasn't implemented
in a safe or clear way.
The way the bloom was or was not on the LoopState was clumsy, and it got
persisted in multiple places without a CRC check.
Intention to implement back in wherby it is requested on-demand by the
Penciller, and then the SFT worker lifts it off disk and CRC checks it.
So it is never on the SFT LoopState. Also it will be easier to control
the logic over which levels have the bloom in the Penciller.
This allows for deleted journals to be retained for a period (the
waste_retnetion_period). The idea being that a backup strategy can
ensure that all journals are backed up, even ones created and removed
from within a backup period - so that any restore pont is possible.
This is also a pre-cursor to removing some of the PromptDelete
complexity from the Inker Clerk - all compactions can prompt deletion as
deletion is now deferred.
There were issues with how the Penciller behaves under ehavy write
pressure - most particularly where there are a large number of keys per
update (i.e. 2i heavy objects). Most immediately the attempt to chekc
whether the l0 file was ready slowed down the process of producing the
L0 file - so back-pressure created more back-pressure.
Going forward want to alter this most significantly as also the work
queue can build up unsustainably. there needs to be some pausing
prompted by the bookie on 'returned', and the use of 'returend when the
work queue exceeds a threshold.
Changes the stratup otpions to a prolist to make it easier to get
environment variables as default.
Tried application:start - and completely baffled as to how to get this
to work.
There was some unpredictable performance in tests, that was related to
the amount of time it took the sft gen_server to accept a cast whihc
passed the levelzero_cache.
The response time looked to be broadly proportional to the size of the
cache - so it appeared to be an issue with passing the large object to
the process queue.
To avoid this, the penciller now instructs the SFT gen_server to
callback to the server for each tree in the cache in turn as it is
building the list from the cache. Each of these requests should be
reltaively short, and the processing in-between should space out the
requests so the Pencille ris not blocked from answering queries when
pompting a L0 write.
Further work on variable reload srategies wiht some unit test coverage.
Also work on potentially supporting no_hash on PUT to journal files for
objects which will never be directly fetched.
There was a test that failed to close down a bookie and that caused some
issues. The issues are double-reoslved, the close down was tidied as
well as the forgotten close being added back in.
There is some generla tidy around in anticipation of TTL support.