Cmpaction is overly aggressive. It is a lot of work to compact a run of
files for just 20% reduction in disk space, when disk space for the
Journal (i.e. low IOPS disk space should be relatively inexpensive).
Require at least a 40% reduction for a compaction job.
When running a load of mainly 2i queries, there is a huge cost in the
previous snapshot code. The time taken to create a clone of the
Penciller (duplicating all the LoopState) varied between 1 and 200ms
depedning on the size of the LoopState.
For 2i queries, most of that LoopState was then being thrown away after
running the query against the levelzero_cache. This was taking < 1ms on
average. It would be better to avoid the o(100)ms of CPU burning and
block for o(1)ms - so th eorder of events have been changed ot filter
first so only the small part of the LoopState actually required is
copied to the clone.
Move to using the DJ Bernstein Magic Hash consistently, and trying to
make sure we only hash once for each operation (as the hash is more
expensive than phash2).
The improved lookup time for missing keys should allow for the L0 index
to be removed, and hence speed up the completion time for push_mem
operations.
It is expected there will be a second stage of creating a tinybloom as
part of the SFT creation process, and then adding that tinybloom to the
manifest. This will then reduce the message passing required for a GET
not in the cache or higher levels
This allows for deleted journals to be retained for a period (the
waste_retnetion_period). The idea being that a backup strategy can
ensure that all journals are backed up, even ones created and removed
from within a backup period - so that any restore pont is possible.
This is also a pre-cursor to removing some of the PromptDelete
complexity from the Inker Clerk - all compactions can prompt deletion as
deletion is now deferred.
When there is write pressure on the penciller and it returns to the
bookie, the bookie will now punish the next PUT (and itself) with a
pause. The longer the back-pressure state has been in place, the more
frequent the pauses
Added a test of journal compaction with a registered snapshot and it
showed that the deleting of files did not correctly check the list of
registerd snapshots. Corrected.
Busted the hashtable in a Journal file, and demonstrated it can be fixed
by changing the extension name (no need to recover from backup if only
the hashtable is bust)
Plugged the ne wpencille rmemory into the Penciller, and took advantage
of the increased speed to simplify the callbacks involved.
The outcome is much simpler code
Removed o(100) lines of code by refactoring the Penciller to no longer
use ETS tables. The code is less confusing, and probably not an awful
lot slower.
The CDB file management server has distinct states, and was growing case
logic to prevent certain messages from being handled in ceratin states,
and to handle different messages differently. So this has now been
converted to a gen_fsm.
As part of resolving this, the space_clear_ondelete test has been
completed, and completing this revealed that the Penciller could not
cope with a change which emptied the ledger. So a series of changes has
been handled to allow it to smoothly progress to an empty manifest.
The no_hash option in CDB files became too hard to manage, in particular
the need to scan the whole file to find the last_key rather than cheat
and use the index. It has been removed for now.
The writing to the journal during journal compaction has now been
enhanced by a mput option on the CDB file write - so it can write each
batch as one pwrite operation.
Further work on variable reload srategies wiht some unit test coverage.
Also work on potentially supporting no_hash on PUT to journal files for
objects which will never be directly fetched.
The current mechanism of re-loading data from the Journla to the Ledger
from any potential SQN is not safe when combined with Journla
compaction.
This commit doesn't resolve thes eproblems, but starts the groundwork
for resolving by introducing Inker Key Types. These types would
differentiate between objects which are standard Key/Value pairs,
objects which are tombstones for keys, and objects whihc represent Key
Changes only.
The idea is that there will be flexible reload strategies based on
object tags
- retain (retain a key change object when compacting a standard object)
- recalc (allow key changes to be recalculated from objects and ledger
state when loading the Ledger from the journal
- recover (allow for the potential loss of data on loss within the
perisste dpart of the ledger, potentially due to recovery through
externla anti-entropy operations).
Further progress towards the tidying up of basement tombstones in the
Ledger, with support added for key-listing to help with testing (and as
a potentially required feature).
The test is incomplete, but committing at this stage as the last commit
broke some tests (within the test code).
There are some outstanding questions about the handling of tombstones in
the Journal during compaction. There exists a condition whereby values
could return if a recent journal is compacted and tombstones are removed
(as they are no longer present), but older journals have not been
compacted. Now on stop/start - if the Ledger is wiped the removal of
the keys will be forgotten but the original PUTs would still remain.
The safest thing maybe to have rule that tombstones are never deleted
from the Inker's Journal - and accept the build-up of garbage. Or there
could be an addition to the compaction process that checks back through
all the inker files to check that the Key of a tombstone is not present
in the past, before it is removed in the compaction.