Firts part of adding support for scanning for Keys and Hashes. as part
of this discovered TTL support did the opposite (only fetched things in
the past!).
There was some unpredictable performance in tests, that was related to
the amount of time it took the sft gen_server to accept a cast whihc
passed the levelzero_cache.
The response time looked to be broadly proportional to the size of the
cache - so it appeared to be an issue with passing the large object to
the process queue.
To avoid this, the penciller now instructs the SFT gen_server to
callback to the server for each tree in the cache in turn as it is
building the list from the cache. Each of these requests should be
reltaively short, and the processing in-between should space out the
requests so the Pencille ris not blocked from answering queries when
pompting a L0 write.
Plugged the ne wpencille rmemory into the Penciller, and took advantage
of the increased speed to simplify the callbacks involved.
The outcome is much simpler code
Not yet integrated, but there is now unit-tested module for the new way
of managing L0 memory cache in the Penciller.
This mechansim is considerably more efficient than previous efforts and
should allow for further simplification of the code.
Stop the penciller from writing an empty file, when shutting down and
the L0 Cache is empty.
Also parameter fiddle to see impact of the Penciller changes.
Removed o(100) lines of code by refactoring the Penciller to no longer
use ETS tables. The code is less confusing, and probably not an awful
lot slower.
Test added for the "retain" recovery strategy. This strategy makes sure
a full history of index changes is made so that if the Ledger is wiped
out, the Ledger cna be fully rebuilt from the Journal.
This exposed two journal compaction problems
- The BestRun selected did not have the source files correctly sorted in
order before compaction
- The compaction process incorrectly dealt with the KeyDelta object
left after a compaction - i.e. compacting twice the same key caused that
key history to be lost.
These issues have now been corrected.
The unit tests for the Penciller couldn't cope with the returned status
- and so would intermittently fail (after tightening the timeout on sft
check_ready.
The CDB file management server has distinct states, and was growing case
logic to prevent certain messages from being handled in ceratin states,
and to handle different messages differently. So this has now been
converted to a gen_fsm.
As part of resolving this, the space_clear_ondelete test has been
completed, and completing this revealed that the Penciller could not
cope with a change which emptied the ledger. So a series of changes has
been handled to allow it to smoothly progress to an empty manifest.
The no_hash option in CDB files became too hard to manage, in particular
the need to scan the whole file to find the last_key rather than cheat
and use the index. It has been removed for now.
The writing to the journal during journal compaction has now been
enhanced by a mput option on the CDB file write - so it can write each
batch as one pwrite operation.
Further work on variable reload srategies wiht some unit test coverage.
Also work on potentially supporting no_hash on PUT to journal files for
objects which will never be directly fetched.
The current mechanism of re-loading data from the Journla to the Ledger
from any potential SQN is not safe when combined with Journla
compaction.
This commit doesn't resolve thes eproblems, but starts the groundwork
for resolving by introducing Inker Key Types. These types would
differentiate between objects which are standard Key/Value pairs,
objects which are tombstones for keys, and objects whihc represent Key
Changes only.
The idea is that there will be flexible reload strategies based on
object tags
- retain (retain a key change object when compacting a standard object)
- recalc (allow key changes to be recalculated from objects and ledger
state when loading the Ledger from the journal
- recover (allow for the potential loss of data on loss within the
perisste dpart of the ledger, potentially due to recovery through
externla anti-entropy operations).
Further progress towards the tidying up of basement tombstones in the
Ledger, with support added for key-listing to help with testing (and as
a potentially required feature).
The test is incomplete, but committing at this stage as the last commit
broke some tests (within the test code).
There are some outstanding questions about the handling of tombstones in
the Journal during compaction. There exists a condition whereby values
could return if a recent journal is compacted and tombstones are removed
(as they are no longer present), but older journals have not been
compacted. Now on stop/start - if the Ledger is wiped the removal of
the keys will be forgotten but the original PUTs would still remain.
The safest thing maybe to have rule that tombstones are never deleted
from the Inker's Journal - and accept the build-up of garbage. Or there
could be an addition to the compaction process that checks back through
all the inker files to check that the Key of a tombstone is not present
in the past, before it is removed in the compaction.
There was a test that failed to close down a bookie and that caused some
issues. The issues are double-reoslved, the close down was tidied as
well as the forgotten close being added back in.
There is some generla tidy around in anticipation of TTL support.
Prepare SFT files for handling tombstones correctly (without expiry
dates).
Also some work as it can be seen from tests that some SFT files ar enot
be cleared out correctly. Pausing before trying t clear out the fles to
experiment and trial the possibility that there is a timing issue.
The SFT shutdown process ahs become a series of casts to-and-from
between Penciller and SFT to stop the two processes syncronously making
requests on each other
The test confirming that deleting sft files wer eheld open whilst
snapshots were registered was actually broken. This test has now been
fixed, as well as the logic in registring snapshots which had used
ledger_sqn mistakenly rather than manifest_sqn.
File check now covered by measure in the sft_new path, whihc will backup
any existing file before moving.
This gets triggered by incomplete changes on shutdown.
Recent fixes have been made to problems associated with rapidly changing
objexts especially on re-opening of the bookie. Test of rotating
objects from both an index query and a fetch perspective added to better
detect such issues in the future.
The Penciller had two problems in previous commits:
- If it had a push_mem soon after a L0 file had been created, the
push_mem would stall waiting for the L0 file to complete - and this
count take 100-200ms
- The penciller's clerk favoured L0 work, but was lazy about asking for
other work in-between, so often the L1 layer was bursting over capacity
and the clerk was doing nothing but merging more L0 files in (with those
merges getting more and more expensive as they had to cover more and
more files)
There are some partial resolutions to this. There is now an aggressive
timeout when checking whther the L0 file is ready on a push_mem, and if
the timeout is breached the error is caught and a 'returned' message
goes back to the Bookie. the Bookie doesn't now empty its cache, it
carrie son filling it, but on some probability it will keep trying to
push_mem on future pushes. This increases Jitter around the expensive
operation and split out the L0 delay into defined chunks.
The penciller's clerk is now more aggressive in asking for work. There
is also some simplification of the relationship between clerk timeouts
and penciller back-pressure.
Also resolved is an issue of inconcistency between the loader and the on
startup (replaying the transaction log) and the standard push_mem
process. The loader was not correctly de-duplicating by adding first
(in order) to a tree before outputting the list from the tree.
Some thought will be given later as to whether non-L0 work can be safely
prioritised if the merge process still keeps getting behind.
The penciller had the concept of a manifest_lock - but it wasn't clear
what the purpose of it was.
The updating of the manifest has now been updated to reduce the code and
make the process cleaner and more obvious. Now the committed manifest
only covers non-L0 levels. A clerk can work concurrently on a manifest
change whilst the Penciller is accepting a new L0 file.
On startup the manifets is opened as well as any L0 file. There is a
possible race condition with killing process where there may be a L0
file which is merged but undeleted - and this is believed to be inert.
There is some outstanding work still. Currently the whole store is
paused if a push_mem is received by the Penciller, and the writing of a
L0 sft file has not been completed. The creation of a L0 file appears
to take about 300ms, so if the ledger_cache fills in this period a pause
will occurr (perhaps due to objects with lots of index entries). It
would be preferable to pause more elegantly in this situation. Perhaps
there should be a harsh timeout on the call to check the SFT complete,
and catching it should cause a refused response. The next PUT will then
wait, but a any queued GETs can progress.
To try and improve performance index entries had been removed from the
Ledger Cache, and a shadow list of the LedgerCache (in SQN order) was
kept to avoid gb_trees:to_list on push_mem.
This did not go well. The issue was that ets does not deal with
duplicate keys in the list when inserting (it will only insert one, but
it is not clear which one).
This has been reverted back out.
The ETS parameters have been changed to [set, private]. It is not used
as an iterator, and is no longer passed out of the process (the
memtable_copy is sent instead). This also avoids the tab2list function
being called.
The 2i work now has tests for removals as well as regex etc.
Some initial refactoring work has also been tried - to try and take some
tasks of the critical path of push_mem. The primary change has been to
avoid putting index keys into the gb_tree, and building the KeyChanges
list in parallel to the gb_tree (now known as ObjectTree) within the
Ledger Cache.
Some initial experiments done as to changing the ETS table in the
Penciller now that it will now be used for iterating - but that has been
reverted for now.
Added basic support for 2i query. This involved some refactoring of the
test code to share functions between suites.
There is sill a need for a Part 2 as no tests currently cover removal of
index entries.
Some additional tests following previous refactoring for abstraction,
primarily to make manifest print safer an dprove co-existence of Riak
and non-Riak objects.
The object tag "o" which was taken from eleveldb has been an extended to
allow for specific functions to be triggered for different object types,
in particular when extracting metadata for stroing in the Ledger.
There is now a riak tag (o_rkv@v1), and in theory other tags can be
added and used, as long as their is an appropriate set of functions in
the leveled_codec.
When the journal CDB file is called to roll it now starts a new clerk to
perform the hashtable calculation (which may take many seconds). This
stops the store from getting blocked if there is an attempt to GET from
the journal that has just been rolled.
The journal file process now has anumber fo distinct states (reading,
writing, pending_roll, closing). A future refactor may look to make
leveled_cdb a gen_fsm rather than a gen_server.