2016-08-02 17:51:43 +01:00
|
|
|
%% -------- PENCILLER ---------
|
2016-07-29 17:19:30 +01:00
|
|
|
%%
|
2016-08-16 12:45:48 +01:00
|
|
|
%% The penciller is responsible for writing and re-writing the ledger - a
|
2016-08-02 13:44:48 +01:00
|
|
|
%% persisted, ordered view of non-recent Keys and Metadata which have been
|
|
|
|
%% added to the store.
|
|
|
|
%% - The penciller maintains a manifest of all the files within the current
|
2017-01-06 10:09:15 +00:00
|
|
|
%% Ledger.
|
2016-08-16 12:45:48 +01:00
|
|
|
%% - The Penciller provides re-write (compaction) work up to be managed by
|
|
|
|
%% the Penciller's Clerk
|
2016-10-09 22:33:45 +01:00
|
|
|
%% - The Penciller can be cloned and maintains a register of clones who have
|
|
|
|
%% requested snapshots of the Ledger
|
2017-01-20 16:36:20 +00:00
|
|
|
%% - The accepts new dumps (in the form of a leveled_tree accomponied by
|
2017-01-06 10:09:15 +00:00
|
|
|
%% an array of hash-listing binaries) from the Bookie, and responds either 'ok'
|
|
|
|
%% to the bookie if the information is accepted nad the Bookie can refresh its
|
|
|
|
%% memory, or 'returned' if the bookie must continue without refreshing as the
|
|
|
|
%% Penciller is not currently able to accept the update (potentially due to a
|
|
|
|
%% backlog of compaction work)
|
2016-10-03 23:34:28 +01:00
|
|
|
%% - The Penciller's persistence of the ledger may not be reliable, in that it
|
|
|
|
%% may lose data but only in sequence from a particular sequence number. On
|
|
|
|
%% startup the Penciller will inform the Bookie of the highest sequence number
|
|
|
|
%% it has, and the Bookie should load any missing data from that point out of
|
2016-11-05 15:59:31 +00:00
|
|
|
%% the journal.
|
2016-07-29 17:19:30 +01:00
|
|
|
%%
|
2016-08-02 17:51:43 +01:00
|
|
|
%% -------- LEDGER ---------
|
2016-07-29 17:19:30 +01:00
|
|
|
%%
|
2016-08-02 13:44:48 +01:00
|
|
|
%% The Ledger is divided into many levels
|
2017-01-06 10:09:15 +00:00
|
|
|
%% - L0: New keys are received from the Bookie and and kept in the levelzero
|
|
|
|
%% cache, until that cache is the size of a SST file, and it is then persisted
|
2016-12-29 02:07:14 +00:00
|
|
|
%% as a SST file at this level. L0 SST files can be larger than the normal
|
2016-08-16 12:45:48 +01:00
|
|
|
%% maximum size - so we don't have to consider problems of either having more
|
|
|
|
%% than one L0 file (and handling what happens on a crash between writing the
|
|
|
|
%% files when the second may have overlapping sequence numbers), or having a
|
|
|
|
%% remainder with overlapping in sequence numbers in memory after the file is
|
2017-01-06 10:09:15 +00:00
|
|
|
%% written. Once the persistence is completed, the L0 cache can be erased.
|
2016-12-29 02:07:14 +00:00
|
|
|
%% There can be only one SST file at Level 0, so the work to merge that file
|
2016-08-16 12:45:48 +01:00
|
|
|
%% to the lower level must be the highest priority, as otherwise writes to the
|
|
|
|
%% ledger will stall, when there is next a need to persist.
|
2016-12-29 02:07:14 +00:00
|
|
|
%% - L1 TO L7: May contain multiple processes managing non-overlapping SST
|
2016-08-16 12:45:48 +01:00
|
|
|
%% files. Compaction work should be sheduled if the number of files exceeds
|
|
|
|
%% the target size of the level, where the target size is 8 ^ n.
|
2016-07-27 18:03:44 +01:00
|
|
|
%%
|
|
|
|
%% The most recent revision of a Key can be found by checking each level until
|
2016-08-02 13:44:48 +01:00
|
|
|
%% the key is found. To check a level the correct file must be sought from the
|
|
|
|
%% manifest for that level, and then a call is made to that file. If the Key
|
|
|
|
%% is not present then every level should be checked.
|
2016-07-27 18:03:44 +01:00
|
|
|
%%
|
|
|
|
%% If a compaction change takes the size of a level beyond the target size,
|
|
|
|
%% then compaction work for that level + 1 should be added to the compaction
|
|
|
|
%% work queue.
|
2016-08-16 12:45:48 +01:00
|
|
|
%% Compaction work is fetched by the Penciller's Clerk because:
|
2016-07-27 18:03:44 +01:00
|
|
|
%% - it has timed out due to a period of inactivity
|
|
|
|
%% - it has been triggered by the a cast to indicate the arrival of high
|
|
|
|
%% priority compaction work
|
2016-08-02 17:51:43 +01:00
|
|
|
%% The Penciller's Clerk (which performs compaction worker) will always call
|
2016-08-16 12:45:48 +01:00
|
|
|
%% the Penciller to find out the highest priority work currently required
|
2016-08-02 17:51:43 +01:00
|
|
|
%% whenever it has either completed work, or a timeout has occurred since it
|
|
|
|
%% was informed there was no work to do.
|
2016-07-27 18:03:44 +01:00
|
|
|
%%
|
2016-08-16 12:45:48 +01:00
|
|
|
%% When the clerk picks work it will take the current manifest, and the
|
|
|
|
%% Penciller assumes the manifest sequence number is to be incremented.
|
2016-11-07 11:17:13 +00:00
|
|
|
%% When the clerk has completed the work it can request that the manifest
|
2016-08-16 12:45:48 +01:00
|
|
|
%% change be committed by the Penciller. The commit is made through changing
|
|
|
|
%% the filename of the new manifest - so the Penciller is not held up by the
|
|
|
|
%% process of wiritng a file, just altering file system metadata.
|
2016-08-02 13:44:48 +01:00
|
|
|
%%
|
2016-08-02 17:51:43 +01:00
|
|
|
%% ---------- PUSH ----------
|
|
|
|
%%
|
2016-08-16 12:45:48 +01:00
|
|
|
%% The Penciller must support the PUSH of a dump of keys from the Bookie. The
|
2016-08-02 17:51:43 +01:00
|
|
|
%% call to PUSH should be immediately acknowledged, and then work should be
|
2017-01-06 10:09:15 +00:00
|
|
|
%% completed to merge the cache update into the L0 cache.
|
2016-08-02 17:51:43 +01:00
|
|
|
%%
|
|
|
|
%% The Penciller MUST NOT accept a new PUSH if the Clerk has commenced the
|
2017-01-06 10:09:15 +00:00
|
|
|
%% conversion of the current L0 cache into a SST file, but not completed this
|
2016-10-27 09:45:05 +01:00
|
|
|
%% change. The Penciller in this case returns the push, and the Bookie should
|
2016-11-07 11:17:13 +00:00
|
|
|
%% continue to grow the cache before trying again.
|
2016-08-02 17:51:43 +01:00
|
|
|
%%
|
|
|
|
%% ---------- FETCH ----------
|
|
|
|
%%
|
2016-11-07 11:17:13 +00:00
|
|
|
%% On request to fetch a key the Penciller should look first in the in-memory
|
2016-12-29 02:07:14 +00:00
|
|
|
%% L0 tree, then look in the SST files Level by Level (including level 0),
|
2016-11-07 11:17:13 +00:00
|
|
|
%% consulting the Manifest to determine which file should be checked at each
|
|
|
|
%% level.
|
2016-08-02 17:51:43 +01:00
|
|
|
%%
|
|
|
|
%% ---------- SNAPSHOT ----------
|
|
|
|
%%
|
2016-10-03 23:34:28 +01:00
|
|
|
%% Iterators may request a snapshot of the database. A snapshot is a cloned
|
2016-10-27 20:56:18 +01:00
|
|
|
%% Penciller seeded not from disk, but by the in-memory L0 gb_tree and the
|
2016-12-29 02:07:14 +00:00
|
|
|
%% in-memory manifest, allowing for direct reference for the SST file processes.
|
2016-10-03 23:34:28 +01:00
|
|
|
%%
|
|
|
|
%% Clones formed to support snapshots are registered by the Penciller, so that
|
2016-12-29 02:07:14 +00:00
|
|
|
%% SST files valid at the point of the snapshot until either the iterator is
|
2016-08-02 17:51:43 +01:00
|
|
|
%% completed or has timed out.
|
|
|
|
%%
|
|
|
|
%% ---------- ON STARTUP ----------
|
|
|
|
%%
|
|
|
|
%% On Startup the Bookie with ask the Penciller to initiate the Ledger first.
|
2016-12-29 02:07:14 +00:00
|
|
|
%% To initiate the Ledger the must consult the manifest, and then start a SST
|
2016-08-02 17:51:43 +01:00
|
|
|
%% management process for each file in the manifest.
|
|
|
|
%%
|
2016-08-15 16:43:39 +01:00
|
|
|
%% The penciller should then try and read any Level 0 file which has the
|
|
|
|
%% manifest sequence number one higher than the last store in the manifest.
|
2016-08-02 17:51:43 +01:00
|
|
|
%%
|
|
|
|
%% The Bookie will ask the Inker for any Keys seen beyond that sequence number
|
|
|
|
%% before the startup of the overall store can be completed.
|
|
|
|
%%
|
|
|
|
%% ---------- ON SHUTDOWN ----------
|
|
|
|
%%
|
|
|
|
%% On a controlled shutdown the Penciller should attempt to write any in-memory
|
2016-12-29 02:07:14 +00:00
|
|
|
%% ETS table to a L0 SST file, assuming one is nto already pending. If one is
|
2016-10-03 23:34:28 +01:00
|
|
|
%% already pending then the Penciller will not persist this part of the Ledger.
|
2016-08-02 17:51:43 +01:00
|
|
|
%%
|
|
|
|
%% ---------- FOLDER STRUCTURE ----------
|
|
|
|
%%
|
|
|
|
%% The following folders are used by the Penciller
|
2016-10-03 23:34:28 +01:00
|
|
|
%% $ROOT/ledger/ledger_manifest/ - used for keeping manifest files
|
2016-12-29 02:07:14 +00:00
|
|
|
%% $ROOT/ledger/ledger_files/ - containing individual SST files
|
2016-08-02 17:51:43 +01:00
|
|
|
%%
|
|
|
|
%% In larger stores there could be a large number of files in the ledger_file
|
|
|
|
%% folder - perhaps o(1000). It is assumed that modern file systems should
|
|
|
|
%% handle this efficiently.
|
|
|
|
%%
|
|
|
|
%% ---------- COMPACTION & MANIFEST UPDATES ----------
|
|
|
|
%%
|
|
|
|
%% The Penciller can have one and only one Clerk for performing compaction
|
|
|
|
%% work. When the Clerk has requested and taken work, it should perform the
|
2016-12-29 02:07:14 +00:00
|
|
|
%5 compaction work starting the new SST process to manage the new Ledger state
|
2016-08-02 17:51:43 +01:00
|
|
|
%% and then write a new manifest file that represents that state with using
|
2016-08-09 16:09:29 +01:00
|
|
|
%% the next Manifest sequence number as the filename:
|
|
|
|
%% - nonzero_<ManifestSQN#>.pnd
|
2016-08-02 17:51:43 +01:00
|
|
|
%%
|
2016-08-09 16:09:29 +01:00
|
|
|
%% The Penciller on accepting the change should rename the manifest file to -
|
|
|
|
%% - nonzero_<ManifestSQN#>.crr
|
2016-08-02 17:51:43 +01:00
|
|
|
%%
|
2016-08-09 16:09:29 +01:00
|
|
|
%% On startup, the Penciller should look for the nonzero_*.crr file with the
|
2016-10-19 17:34:58 +01:00
|
|
|
%% highest such manifest sequence number. This will be started as the
|
2016-12-29 02:07:14 +00:00
|
|
|
%% manifest, together with any _0_0.sst file found at that Manifest SQN.
|
2016-10-19 17:34:58 +01:00
|
|
|
%% Level zero files are not kept in the persisted manifest, and adding a L0
|
|
|
|
%% file does not advanced the Manifest SQN.
|
2016-08-02 17:51:43 +01:00
|
|
|
%%
|
|
|
|
%% The pace at which the store can accept updates will be dependent on the
|
|
|
|
%% speed at which the Penciller's Clerk can merge files at lower levels plus
|
|
|
|
%% the time it takes to merge from Level 0. As if a clerk has commenced
|
2016-12-29 02:07:14 +00:00
|
|
|
%% compaction work at a lower level and then immediately a L0 SST file is
|
2016-08-02 17:51:43 +01:00
|
|
|
%% written the Penciller will need to wait for this compaction work to
|
|
|
|
%% complete and the L0 file to be compacted before the ETS table can be
|
|
|
|
%% allowed to again reach capacity
|
2016-08-09 16:09:29 +01:00
|
|
|
%%
|
|
|
|
%% The writing of L0 files do not require the involvement of the clerk.
|
2016-10-27 20:56:18 +01:00
|
|
|
%% The L0 files are prompted directly by the penciller when the in-memory tree
|
2016-11-07 11:17:13 +00:00
|
|
|
%% has reached capacity. This places the penciller in a levelzero_pending
|
2016-12-29 02:07:14 +00:00
|
|
|
%% state, and in this state it must return new pushes. Once the SST file has
|
2016-11-07 11:17:13 +00:00
|
|
|
%% been completed it will confirm completion to the penciller which can then
|
|
|
|
%% revert the levelzero_pending state, add the file to the manifest and clear
|
|
|
|
%% the current level zero in-memory view.
|
|
|
|
%%
|
|
|
|
|
2016-07-28 17:22:50 +01:00
|
|
|
|
2016-08-02 13:44:48 +01:00
|
|
|
-module(leveled_penciller).
|
2016-07-27 18:03:44 +01:00
|
|
|
|
2016-08-09 16:09:29 +01:00
|
|
|
-behaviour(gen_server).
|
|
|
|
|
2016-10-18 01:59:03 +01:00
|
|
|
-include("include/leveled.hrl").
|
2016-07-27 18:03:44 +01:00
|
|
|
|
2017-01-13 18:23:57 +00:00
|
|
|
-export([
|
|
|
|
init/1,
|
2016-08-09 16:09:29 +01:00
|
|
|
handle_call/3,
|
|
|
|
handle_cast/2,
|
|
|
|
handle_info/2,
|
|
|
|
terminate/2,
|
2021-10-27 13:42:53 +01:00
|
|
|
code_change/3,
|
|
|
|
format_status/2]).
|
2017-01-13 18:23:57 +00:00
|
|
|
|
|
|
|
-export([
|
2018-07-10 10:25:20 +01:00
|
|
|
pcl_snapstart/1,
|
2016-08-09 16:09:29 +01:00
|
|
|
pcl_start/1,
|
|
|
|
pcl_pushmem/2,
|
2019-02-26 18:16:47 +00:00
|
|
|
pcl_fetchlevelzero/3,
|
2018-06-23 15:15:49 +01:00
|
|
|
pcl_fetch/4,
|
2016-10-12 17:12:49 +01:00
|
|
|
pcl_fetchkeys/5,
|
2018-09-24 19:54:28 +01:00
|
|
|
pcl_fetchkeys/6,
|
2018-10-31 00:09:24 +00:00
|
|
|
pcl_fetchkeysbysegment/8,
|
2016-11-20 21:21:31 +00:00
|
|
|
pcl_fetchnextkey/5,
|
2016-09-26 10:55:08 +01:00
|
|
|
pcl_checksequencenumber/3,
|
2016-08-09 16:09:29 +01:00
|
|
|
pcl_workforclerk/1,
|
2017-01-14 16:36:05 +00:00
|
|
|
pcl_manifestchange/2,
|
2017-11-28 01:19:30 +00:00
|
|
|
pcl_confirml0complete/5,
|
2017-01-17 11:18:58 +00:00
|
|
|
pcl_confirmdelete/3,
|
2016-08-15 16:43:39 +01:00
|
|
|
pcl_close/1,
|
2016-11-21 12:34:40 +00:00
|
|
|
pcl_doom/1,
|
2016-10-12 17:12:49 +01:00
|
|
|
pcl_releasesnapshot/2,
|
2017-04-05 09:16:01 +01:00
|
|
|
pcl_registersnapshot/5,
|
2017-11-28 11:43:46 +00:00
|
|
|
pcl_getstartupsequencenumber/1,
|
|
|
|
pcl_checkbloomtest/2,
|
2018-02-15 16:14:46 +00:00
|
|
|
pcl_checkforwork/1,
|
2018-12-11 21:59:57 +00:00
|
|
|
pcl_persistedsqn/1,
|
|
|
|
pcl_loglevel/2,
|
|
|
|
pcl_addlogs/2,
|
|
|
|
pcl_removelogs/2]).
|
2017-01-13 18:23:57 +00:00
|
|
|
|
|
|
|
-export([
|
2017-03-09 21:23:09 +00:00
|
|
|
sst_rootpath/1,
|
|
|
|
sst_filename/3]).
|
|
|
|
|
2023-03-14 16:27:08 +00:00
|
|
|
-ifdef(TEST).
|
2017-03-09 21:23:09 +00:00
|
|
|
-export([
|
2016-09-15 10:53:24 +01:00
|
|
|
clean_testdir/1]).
|
2023-03-14 16:27:08 +00:00
|
|
|
-endif.
|
2016-07-27 18:03:44 +01:00
|
|
|
|
|
|
|
-define(MAX_WORK_WAIT, 300).
|
2016-08-02 17:51:43 +01:00
|
|
|
-define(MANIFEST_FP, "ledger_manifest").
|
|
|
|
-define(FILES_FP, "ledger_files").
|
|
|
|
-define(CURRENT_FILEX, "crr").
|
|
|
|
-define(PENDING_FILEX, "pnd").
|
2017-09-27 23:52:49 +01:00
|
|
|
-define(SST_FILEX, ".sst").
|
|
|
|
-define(ARCHIVE_FILEX, ".bak").
|
2016-12-11 04:53:36 +00:00
|
|
|
-define(SUPER_MAX_TABLE_SIZE, 40000).
|
2016-08-16 12:45:48 +01:00
|
|
|
-define(PROMPT_WAIT_ONL0, 5).
|
2016-11-05 14:04:45 +00:00
|
|
|
-define(WORKQUEUE_BACKLOG_TOLERANCE, 4).
|
2020-12-22 12:34:01 +00:00
|
|
|
-define(COIN_SIDECOUNT, 4).
|
2019-05-11 13:26:07 +01:00
|
|
|
-define(SLOW_FETCH, 500000). % Log a very slow fetch - longer than 500ms
|
2016-12-29 10:21:57 +00:00
|
|
|
-define(ITERATOR_SCANWIDTH, 4).
|
2017-11-21 19:58:36 +00:00
|
|
|
-define(TIMING_SAMPLECOUNTDOWN, 10000).
|
|
|
|
-define(TIMING_SAMPLESIZE, 100).
|
2023-11-10 15:04:47 +00:00
|
|
|
-define(SHUTDOWN_LOOPS, 10).
|
2023-10-03 18:30:40 +01:00
|
|
|
-define(SHUTDOWN_PAUSE, 10000).
|
|
|
|
% How long to wait for snapshots to be released on shutdown
|
|
|
|
% before forcing closure of snapshots
|
|
|
|
% 10s may not be long enough for all snapshots, but avoids crashes of
|
|
|
|
% short-lived queries racing with the shutdown
|
2016-09-21 18:31:42 +01:00
|
|
|
|
2021-10-27 13:42:53 +01:00
|
|
|
-record(state, {manifest ::
|
|
|
|
leveled_pmanifest:manifest() | undefined | redacted,
|
2022-10-11 13:45:55 +01:00
|
|
|
query_manifest ::
|
|
|
|
{list(),
|
|
|
|
leveled_codec:ledger_key(),
|
|
|
|
leveled_codec:ledger_key()} | undefined,
|
|
|
|
% Slimmed down version of the manifest containing part
|
|
|
|
% related to specific query, and the StartKey/EndKey
|
|
|
|
% used to extract this part
|
|
|
|
|
Develop 3.1 d30update (#386)
* Mas i370 patch d (#383)
* Refactor penciller memory
In high-volume tests on large key-count clusters, so significant variation in the P0031 time has been seen:
TimeBucket PatchA
a.0ms_to_1ms 18554
b.1ms_to_2ms 51778
c.2ms_to_3ms 696
d.3ms_to_5ms 220
e.5ms_to_8ms 59
f.8ms_to_13ms 40
g.13ms_to_21ms 364
h.21ms_to_34ms 277
i.34ms_to_55ms 34
j.55ms_to_89ms 17
k.89ms_to_144ms 21
l.144ms_to_233ms 31
m.233ms_to_377ms 45
n.377ms_to_610ms 52
o.610ms_to_987ms 59
p.987ms_to_1597ms 55
q.1597ms_to_2684ms 54
r.2684ms_to_4281ms 29
s.4281ms_to_6965ms 7
t.6295ms_to_11246ms 1
It is unclear why this varies so much. The time to add to the cache appears to be minimal (but perhaps there is an issue with timing points in the code), whereas the time to add to the index is much more significant and variable. There is also variable time when the memory is rolled (although the actual activity here appears to be minimal.
The refactoring here is two-fold:
- tidy and simplify by keeping LoopState managed within handle_call, and add more helpful dialyzer specs;
- change the update to the index to be a simple extension of a list, rather than any conversion.
This alternative version of the pmem index in unit test is orders of magnitude faster to add - and is the same order of magnitude to check. Anticipation is that it may be more efficient in terms of memory changes.
* Compress SST index
Reduces the size of the leveled_sst index with two changes:
1 - Where there is a common prefix of tuple elements (e.g. Bucket) across the whole leveled_sst file - only the non-common part is indexed, and a function is used to compare.
2 - There is less "indexing" of the index i.e. only 1 in 16 keys are passed into the gb_trees part instead of 1 in 4
* Immediate hibernate
Reasons for delay in hibernate were not clear.
Straight after creation the process will not be in receipt of messages (must wait for the manifest to be updated), so better to hibernate now. This also means the log PC023 provides more accurate information.
* Refactor BIC
This patch avoids the following:
- repeated replacement of the same element in the BIC (via get_kvrange), by checking presence via GET before sing SET
- Stops re-reading of all elements to discover high modified date
Also there appears to have been a bug where a missing HMD for the file is required to add to the cache. However, now the cache may be erased without erasing the HMD. This means that the cache can never be rebuilt
* Use correct size in test results
erts_debug:flat_size/1 returns size in words (i.e. 8 bytes on 64-bit CPU) not bytes
* Don't change summary record
As it is persisted as part of the file write, any change to the summary record cannot be rolled back
* Clerk to prompt L0 write
Simplifies the logic if the clerk request work for the penciller prompts L0 writes as well as Manifest changes.
The advantage now is that if the penciller memory is full, and PUT load stops, the clerk should still be able to prompt persistence. the penciller can therefore make use of dead time this way
* Add push on journal compact
If there has been a backlog, followed by a quiet period - there may be a large ledger cache left unpushed. Journal compaction events are about once per hour, so the performance overhead of a false push should be minimal, with the advantage of clearing any backlog before load starts again.
This is only relevant to riak users with very off/full batch type workloads.
* Extend tests
To more consistently trigger all overload scenarios
* Fix range keys smaller than prefix
Can't make end key an empty binary in this case, as it may be bigger than any keys within the range, but will appear to be smaller.
Unit tests and ct tests added to expose the potential issue
* Tidy-up
- Remove penciller logs which are no longer called
- Get pclerk to only wait MIN_TIMEOUT after doing work, in case there is a backlog
- Remove update_levelzero_cache function as it is unique to handle_call of push_mem, and simple enough to be inline
- Alight testutil slow offer with standard slow offer used
* Tidy-up
Remove pre-otp20 references.
Reinstate the check that the starting pid is still active, this was added to tidy up shutdown.
Resolve failure to run on otp20 due to `-if` sttaement
* Tidy up
Using null rather then {null, Key} is potentially clearer as it is not a concern what they Key is in this case, and removes a comparison step from the leveled_codec:endkey_passed/2 function.
There were issues with coverage in eunit tests as the leveled_pclerk shut down. This prompted a general tidy of leveled_pclerk (remove passing of LoopState into internal functions, and add dialyzer specs.
* Remove R16 relic
* Further testing another issue
The StartKey must always be less than or equal to the prefix when the first N characters are stripped, but this is not true of the EndKey (for the query) which does not have to be between the FirstKey and the LastKey.
If the EndKey query does not match it must be greater than the Prefix (as otherwise it would not have been greater than the FirstKey - so set to null.
* Fix unit test
Unit test had a typo - and result interpretation had a misunderstanding.
* Code and spec tidy
Also look to the cover the situation when the FirstKey is the same as the Prefix with tests.
This is, in theory, not an issue as it is the EndKey for each sublist which is indexed in leveled_tree. However, guard against it mapping to null here, just in case there are dangers lurking (note that tests will still pass without `M > N` guard in place.
* Hibernate on BIC complete
There are three situations when the BIC becomes complete:
- In a file created as part of a merge the BIS is learned in the merge
- After startup, files below L1 learn the block cache through reads that happen to read the block, eventually the while cache will be read, unless...
- Either before/after the cache is complete, it can get whiped by a timeout after a get_sqn request (e.g. as prompted by a journal compaction) ... it will then be re-filled of the back of get/get-range requests.
In all these situations we want to hibernate after the BIC is fill - to reflect the fact that the LoopState should now be relatively stable, so it is a good point to GC and rationalise location of data.
Previously on the the first base was covered. Now all three are covered through the bic_complete message.
* Test all index keys have same term
This works functionally, but is not optimised (the term is replicated in the index)
* Summaries with same index term
If the summary index all have the same index term - only the object keys need to be indexes
* Simplify case statements
We either match the pattern of <<Prefix:N, Suffix>> or the answer should be null
* OK for M == N
If M = N for the first key, it will have a suffix of <<>>. This will match (as expected) a query Start Key of the sam size, and be smaller than any query Start Key that has the same prefix.
If the query Start Key does not match the prefix - it will be null - as it must be smaller than the Prefix (as other wise the query Start Key would be bigger than the Last Key).
The constraint of M > N was introduced before the *_prefix_filter functions were checking the prefix, to avoid issues. Now the prefix is being checked, then M == N is ok.
* Simplify
Correct the test to use a binary field in the range.
To avoid further issue, only apply filter when everything is a binary() type.
* Add test for head_only mode
When leveled is used as a tictacaae key store (in parallel mode), the keys will be head_only entries. Double check they are handled as expected like object keys
* Revert previous change - must support typed buckets
Add assertion to confirm worthwhile optimisation
* Add support for configurable cache multiple (#375)
* Mas i370 patch e (#385)
Improvement to monitoring for efficiency and improved readability of logs and stats.
As part of this, where possible, tried to avoid updating loop state on READ messages in leveled processes (as was the case when tracking stats within each process).
No performance benefits found with change, but improved stats has helped discover other potential gains.
2022-12-18 20:18:03 +00:00
|
|
|
persisted_sqn = 0 :: integer(), % The highest SQN persisted
|
2017-01-12 13:48:43 +00:00
|
|
|
ledger_sqn = 0 :: integer(), % The highest SQN added to L0
|
2016-10-27 20:56:18 +01:00
|
|
|
|
|
|
|
levelzero_pending = false :: boolean(),
|
2017-07-31 19:39:40 +02:00
|
|
|
levelzero_constructor :: pid() | undefined,
|
2021-10-27 13:42:53 +01:00
|
|
|
levelzero_cache = [] :: levelzero_cache() | redacted,
|
2016-10-30 18:25:30 +00:00
|
|
|
levelzero_size = 0 :: integer(),
|
2017-07-31 19:39:40 +02:00
|
|
|
levelzero_maxcachesize :: integer() | undefined,
|
2016-12-09 14:36:03 +00:00
|
|
|
levelzero_cointoss = false :: boolean(),
|
Develop 3.1 d30update (#386)
* Mas i370 patch d (#383)
* Refactor penciller memory
In high-volume tests on large key-count clusters, so significant variation in the P0031 time has been seen:
TimeBucket PatchA
a.0ms_to_1ms 18554
b.1ms_to_2ms 51778
c.2ms_to_3ms 696
d.3ms_to_5ms 220
e.5ms_to_8ms 59
f.8ms_to_13ms 40
g.13ms_to_21ms 364
h.21ms_to_34ms 277
i.34ms_to_55ms 34
j.55ms_to_89ms 17
k.89ms_to_144ms 21
l.144ms_to_233ms 31
m.233ms_to_377ms 45
n.377ms_to_610ms 52
o.610ms_to_987ms 59
p.987ms_to_1597ms 55
q.1597ms_to_2684ms 54
r.2684ms_to_4281ms 29
s.4281ms_to_6965ms 7
t.6295ms_to_11246ms 1
It is unclear why this varies so much. The time to add to the cache appears to be minimal (but perhaps there is an issue with timing points in the code), whereas the time to add to the index is much more significant and variable. There is also variable time when the memory is rolled (although the actual activity here appears to be minimal.
The refactoring here is two-fold:
- tidy and simplify by keeping LoopState managed within handle_call, and add more helpful dialyzer specs;
- change the update to the index to be a simple extension of a list, rather than any conversion.
This alternative version of the pmem index in unit test is orders of magnitude faster to add - and is the same order of magnitude to check. Anticipation is that it may be more efficient in terms of memory changes.
* Compress SST index
Reduces the size of the leveled_sst index with two changes:
1 - Where there is a common prefix of tuple elements (e.g. Bucket) across the whole leveled_sst file - only the non-common part is indexed, and a function is used to compare.
2 - There is less "indexing" of the index i.e. only 1 in 16 keys are passed into the gb_trees part instead of 1 in 4
* Immediate hibernate
Reasons for delay in hibernate were not clear.
Straight after creation the process will not be in receipt of messages (must wait for the manifest to be updated), so better to hibernate now. This also means the log PC023 provides more accurate information.
* Refactor BIC
This patch avoids the following:
- repeated replacement of the same element in the BIC (via get_kvrange), by checking presence via GET before sing SET
- Stops re-reading of all elements to discover high modified date
Also there appears to have been a bug where a missing HMD for the file is required to add to the cache. However, now the cache may be erased without erasing the HMD. This means that the cache can never be rebuilt
* Use correct size in test results
erts_debug:flat_size/1 returns size in words (i.e. 8 bytes on 64-bit CPU) not bytes
* Don't change summary record
As it is persisted as part of the file write, any change to the summary record cannot be rolled back
* Clerk to prompt L0 write
Simplifies the logic if the clerk request work for the penciller prompts L0 writes as well as Manifest changes.
The advantage now is that if the penciller memory is full, and PUT load stops, the clerk should still be able to prompt persistence. the penciller can therefore make use of dead time this way
* Add push on journal compact
If there has been a backlog, followed by a quiet period - there may be a large ledger cache left unpushed. Journal compaction events are about once per hour, so the performance overhead of a false push should be minimal, with the advantage of clearing any backlog before load starts again.
This is only relevant to riak users with very off/full batch type workloads.
* Extend tests
To more consistently trigger all overload scenarios
* Fix range keys smaller than prefix
Can't make end key an empty binary in this case, as it may be bigger than any keys within the range, but will appear to be smaller.
Unit tests and ct tests added to expose the potential issue
* Tidy-up
- Remove penciller logs which are no longer called
- Get pclerk to only wait MIN_TIMEOUT after doing work, in case there is a backlog
- Remove update_levelzero_cache function as it is unique to handle_call of push_mem, and simple enough to be inline
- Alight testutil slow offer with standard slow offer used
* Tidy-up
Remove pre-otp20 references.
Reinstate the check that the starting pid is still active, this was added to tidy up shutdown.
Resolve failure to run on otp20 due to `-if` sttaement
* Tidy up
Using null rather then {null, Key} is potentially clearer as it is not a concern what they Key is in this case, and removes a comparison step from the leveled_codec:endkey_passed/2 function.
There were issues with coverage in eunit tests as the leveled_pclerk shut down. This prompted a general tidy of leveled_pclerk (remove passing of LoopState into internal functions, and add dialyzer specs.
* Remove R16 relic
* Further testing another issue
The StartKey must always be less than or equal to the prefix when the first N characters are stripped, but this is not true of the EndKey (for the query) which does not have to be between the FirstKey and the LastKey.
If the EndKey query does not match it must be greater than the Prefix (as otherwise it would not have been greater than the FirstKey - so set to null.
* Fix unit test
Unit test had a typo - and result interpretation had a misunderstanding.
* Code and spec tidy
Also look to the cover the situation when the FirstKey is the same as the Prefix with tests.
This is, in theory, not an issue as it is the EndKey for each sublist which is indexed in leveled_tree. However, guard against it mapping to null here, just in case there are dangers lurking (note that tests will still pass without `M > N` guard in place.
* Hibernate on BIC complete
There are three situations when the BIC becomes complete:
- In a file created as part of a merge the BIS is learned in the merge
- After startup, files below L1 learn the block cache through reads that happen to read the block, eventually the while cache will be read, unless...
- Either before/after the cache is complete, it can get whiped by a timeout after a get_sqn request (e.g. as prompted by a journal compaction) ... it will then be re-filled of the back of get/get-range requests.
In all these situations we want to hibernate after the BIC is fill - to reflect the fact that the LoopState should now be relatively stable, so it is a good point to GC and rationalise location of data.
Previously on the the first base was covered. Now all three are covered through the bic_complete message.
* Test all index keys have same term
This works functionally, but is not optimised (the term is replicated in the index)
* Summaries with same index term
If the summary index all have the same index term - only the object keys need to be indexes
* Simplify case statements
We either match the pattern of <<Prefix:N, Suffix>> or the answer should be null
* OK for M == N
If M = N for the first key, it will have a suffix of <<>>. This will match (as expected) a query Start Key of the sam size, and be smaller than any query Start Key that has the same prefix.
If the query Start Key does not match the prefix - it will be null - as it must be smaller than the Prefix (as other wise the query Start Key would be bigger than the Last Key).
The constraint of M > N was introduced before the *_prefix_filter functions were checking the prefix, to avoid issues. Now the prefix is being checked, then M == N is ok.
* Simplify
Correct the test to use a binary field in the range.
To avoid further issue, only apply filter when everything is a binary() type.
* Add test for head_only mode
When leveled is used as a tictacaae key store (in parallel mode), the keys will be head_only entries. Double check they are handled as expected like object keys
* Revert previous change - must support typed buckets
Add assertion to confirm worthwhile optimisation
* Add support for configurable cache multiple (#375)
* Mas i370 patch e (#385)
Improvement to monitoring for efficiency and improved readability of logs and stats.
As part of this, where possible, tried to avoid updating loop state on READ messages in leveled processes (as was the case when tracking stats within each process).
No performance benefits found with change, but improved stats has helped discover other potential gains.
2022-12-18 20:18:03 +00:00
|
|
|
levelzero_index ::
|
|
|
|
leveled_pmem:index_array() | undefined | redacted,
|
|
|
|
levelzero_astree :: list() | undefined | redacted,
|
|
|
|
|
|
|
|
root_path = "test" :: string(),
|
|
|
|
clerk :: pid() | undefined,
|
2016-10-27 20:56:18 +01:00
|
|
|
|
2016-09-23 18:50:29 +01:00
|
|
|
is_snapshot = false :: boolean(),
|
|
|
|
snapshot_fully_loaded = false :: boolean(),
|
2020-12-04 12:49:17 +00:00
|
|
|
snapshot_time :: pos_integer() | undefined,
|
2017-07-31 19:39:40 +02:00
|
|
|
source_penciller :: pid() | undefined,
|
2018-12-11 20:42:00 +00:00
|
|
|
bookie_monref :: reference() | undefined,
|
2016-10-27 20:56:18 +01:00
|
|
|
|
2017-01-14 22:03:57 +00:00
|
|
|
work_ongoing = false :: boolean(), % i.e. compaction work
|
|
|
|
work_backlog = false :: boolean(), % i.e. compaction work
|
2022-05-24 09:42:54 +01:00
|
|
|
|
|
|
|
pending_removals = [] :: list(string()),
|
|
|
|
maybe_release = false :: boolean(),
|
2016-12-22 14:03:31 +00:00
|
|
|
|
2018-12-14 11:23:04 +00:00
|
|
|
snaptimeout_short :: pos_integer()|undefined,
|
|
|
|
snaptimeout_long :: pos_integer()|undefined,
|
|
|
|
|
Develop 3.1 d30update (#386)
* Mas i370 patch d (#383)
* Refactor penciller memory
In high-volume tests on large key-count clusters, so significant variation in the P0031 time has been seen:
TimeBucket PatchA
a.0ms_to_1ms 18554
b.1ms_to_2ms 51778
c.2ms_to_3ms 696
d.3ms_to_5ms 220
e.5ms_to_8ms 59
f.8ms_to_13ms 40
g.13ms_to_21ms 364
h.21ms_to_34ms 277
i.34ms_to_55ms 34
j.55ms_to_89ms 17
k.89ms_to_144ms 21
l.144ms_to_233ms 31
m.233ms_to_377ms 45
n.377ms_to_610ms 52
o.610ms_to_987ms 59
p.987ms_to_1597ms 55
q.1597ms_to_2684ms 54
r.2684ms_to_4281ms 29
s.4281ms_to_6965ms 7
t.6295ms_to_11246ms 1
It is unclear why this varies so much. The time to add to the cache appears to be minimal (but perhaps there is an issue with timing points in the code), whereas the time to add to the index is much more significant and variable. There is also variable time when the memory is rolled (although the actual activity here appears to be minimal.
The refactoring here is two-fold:
- tidy and simplify by keeping LoopState managed within handle_call, and add more helpful dialyzer specs;
- change the update to the index to be a simple extension of a list, rather than any conversion.
This alternative version of the pmem index in unit test is orders of magnitude faster to add - and is the same order of magnitude to check. Anticipation is that it may be more efficient in terms of memory changes.
* Compress SST index
Reduces the size of the leveled_sst index with two changes:
1 - Where there is a common prefix of tuple elements (e.g. Bucket) across the whole leveled_sst file - only the non-common part is indexed, and a function is used to compare.
2 - There is less "indexing" of the index i.e. only 1 in 16 keys are passed into the gb_trees part instead of 1 in 4
* Immediate hibernate
Reasons for delay in hibernate were not clear.
Straight after creation the process will not be in receipt of messages (must wait for the manifest to be updated), so better to hibernate now. This also means the log PC023 provides more accurate information.
* Refactor BIC
This patch avoids the following:
- repeated replacement of the same element in the BIC (via get_kvrange), by checking presence via GET before sing SET
- Stops re-reading of all elements to discover high modified date
Also there appears to have been a bug where a missing HMD for the file is required to add to the cache. However, now the cache may be erased without erasing the HMD. This means that the cache can never be rebuilt
* Use correct size in test results
erts_debug:flat_size/1 returns size in words (i.e. 8 bytes on 64-bit CPU) not bytes
* Don't change summary record
As it is persisted as part of the file write, any change to the summary record cannot be rolled back
* Clerk to prompt L0 write
Simplifies the logic if the clerk request work for the penciller prompts L0 writes as well as Manifest changes.
The advantage now is that if the penciller memory is full, and PUT load stops, the clerk should still be able to prompt persistence. the penciller can therefore make use of dead time this way
* Add push on journal compact
If there has been a backlog, followed by a quiet period - there may be a large ledger cache left unpushed. Journal compaction events are about once per hour, so the performance overhead of a false push should be minimal, with the advantage of clearing any backlog before load starts again.
This is only relevant to riak users with very off/full batch type workloads.
* Extend tests
To more consistently trigger all overload scenarios
* Fix range keys smaller than prefix
Can't make end key an empty binary in this case, as it may be bigger than any keys within the range, but will appear to be smaller.
Unit tests and ct tests added to expose the potential issue
* Tidy-up
- Remove penciller logs which are no longer called
- Get pclerk to only wait MIN_TIMEOUT after doing work, in case there is a backlog
- Remove update_levelzero_cache function as it is unique to handle_call of push_mem, and simple enough to be inline
- Alight testutil slow offer with standard slow offer used
* Tidy-up
Remove pre-otp20 references.
Reinstate the check that the starting pid is still active, this was added to tidy up shutdown.
Resolve failure to run on otp20 due to `-if` sttaement
* Tidy up
Using null rather then {null, Key} is potentially clearer as it is not a concern what they Key is in this case, and removes a comparison step from the leveled_codec:endkey_passed/2 function.
There were issues with coverage in eunit tests as the leveled_pclerk shut down. This prompted a general tidy of leveled_pclerk (remove passing of LoopState into internal functions, and add dialyzer specs.
* Remove R16 relic
* Further testing another issue
The StartKey must always be less than or equal to the prefix when the first N characters are stripped, but this is not true of the EndKey (for the query) which does not have to be between the FirstKey and the LastKey.
If the EndKey query does not match it must be greater than the Prefix (as otherwise it would not have been greater than the FirstKey - so set to null.
* Fix unit test
Unit test had a typo - and result interpretation had a misunderstanding.
* Code and spec tidy
Also look to the cover the situation when the FirstKey is the same as the Prefix with tests.
This is, in theory, not an issue as it is the EndKey for each sublist which is indexed in leveled_tree. However, guard against it mapping to null here, just in case there are dangers lurking (note that tests will still pass without `M > N` guard in place.
* Hibernate on BIC complete
There are three situations when the BIC becomes complete:
- In a file created as part of a merge the BIS is learned in the merge
- After startup, files below L1 learn the block cache through reads that happen to read the block, eventually the while cache will be read, unless...
- Either before/after the cache is complete, it can get whiped by a timeout after a get_sqn request (e.g. as prompted by a journal compaction) ... it will then be re-filled of the back of get/get-range requests.
In all these situations we want to hibernate after the BIC is fill - to reflect the fact that the LoopState should now be relatively stable, so it is a good point to GC and rationalise location of data.
Previously on the the first base was covered. Now all three are covered through the bic_complete message.
* Test all index keys have same term
This works functionally, but is not optimised (the term is replicated in the index)
* Summaries with same index term
If the summary index all have the same index term - only the object keys need to be indexes
* Simplify case statements
We either match the pattern of <<Prefix:N, Suffix>> or the answer should be null
* OK for M == N
If M = N for the first key, it will have a suffix of <<>>. This will match (as expected) a query Start Key of the sam size, and be smaller than any query Start Key that has the same prefix.
If the query Start Key does not match the prefix - it will be null - as it must be smaller than the Prefix (as other wise the query Start Key would be bigger than the Last Key).
The constraint of M > N was introduced before the *_prefix_filter functions were checking the prefix, to avoid issues. Now the prefix is being checked, then M == N is ok.
* Simplify
Correct the test to use a binary field in the range.
To avoid further issue, only apply filter when everything is a binary() type.
* Add test for head_only mode
When leveled is used as a tictacaae key store (in parallel mode), the keys will be head_only entries. Double check they are handled as expected like object keys
* Revert previous change - must support typed buckets
Add assertion to confirm worthwhile optimisation
* Add support for configurable cache multiple (#375)
* Mas i370 patch e (#385)
Improvement to monitoring for efficiency and improved readability of logs and stats.
As part of this, where possible, tried to avoid updating loop state on READ messages in leveled processes (as was the case when tracking stats within each process).
No performance benefits found with change, but improved stats has helped discover other potential gains.
2022-12-18 20:18:03 +00:00
|
|
|
monitor = {no_monitor, 0} :: leveled_monitor:monitor(),
|
|
|
|
|
2023-11-10 15:04:47 +00:00
|
|
|
sst_options = #sst_options{} :: sst_options(),
|
|
|
|
|
|
|
|
shutdown_loops = ?SHUTDOWN_LOOPS :: non_neg_integer()
|
|
|
|
}).
|
Develop 3.1 d30update (#386)
* Mas i370 patch d (#383)
* Refactor penciller memory
In high-volume tests on large key-count clusters, so significant variation in the P0031 time has been seen:
TimeBucket PatchA
a.0ms_to_1ms 18554
b.1ms_to_2ms 51778
c.2ms_to_3ms 696
d.3ms_to_5ms 220
e.5ms_to_8ms 59
f.8ms_to_13ms 40
g.13ms_to_21ms 364
h.21ms_to_34ms 277
i.34ms_to_55ms 34
j.55ms_to_89ms 17
k.89ms_to_144ms 21
l.144ms_to_233ms 31
m.233ms_to_377ms 45
n.377ms_to_610ms 52
o.610ms_to_987ms 59
p.987ms_to_1597ms 55
q.1597ms_to_2684ms 54
r.2684ms_to_4281ms 29
s.4281ms_to_6965ms 7
t.6295ms_to_11246ms 1
It is unclear why this varies so much. The time to add to the cache appears to be minimal (but perhaps there is an issue with timing points in the code), whereas the time to add to the index is much more significant and variable. There is also variable time when the memory is rolled (although the actual activity here appears to be minimal.
The refactoring here is two-fold:
- tidy and simplify by keeping LoopState managed within handle_call, and add more helpful dialyzer specs;
- change the update to the index to be a simple extension of a list, rather than any conversion.
This alternative version of the pmem index in unit test is orders of magnitude faster to add - and is the same order of magnitude to check. Anticipation is that it may be more efficient in terms of memory changes.
* Compress SST index
Reduces the size of the leveled_sst index with two changes:
1 - Where there is a common prefix of tuple elements (e.g. Bucket) across the whole leveled_sst file - only the non-common part is indexed, and a function is used to compare.
2 - There is less "indexing" of the index i.e. only 1 in 16 keys are passed into the gb_trees part instead of 1 in 4
* Immediate hibernate
Reasons for delay in hibernate were not clear.
Straight after creation the process will not be in receipt of messages (must wait for the manifest to be updated), so better to hibernate now. This also means the log PC023 provides more accurate information.
* Refactor BIC
This patch avoids the following:
- repeated replacement of the same element in the BIC (via get_kvrange), by checking presence via GET before sing SET
- Stops re-reading of all elements to discover high modified date
Also there appears to have been a bug where a missing HMD for the file is required to add to the cache. However, now the cache may be erased without erasing the HMD. This means that the cache can never be rebuilt
* Use correct size in test results
erts_debug:flat_size/1 returns size in words (i.e. 8 bytes on 64-bit CPU) not bytes
* Don't change summary record
As it is persisted as part of the file write, any change to the summary record cannot be rolled back
* Clerk to prompt L0 write
Simplifies the logic if the clerk request work for the penciller prompts L0 writes as well as Manifest changes.
The advantage now is that if the penciller memory is full, and PUT load stops, the clerk should still be able to prompt persistence. the penciller can therefore make use of dead time this way
* Add push on journal compact
If there has been a backlog, followed by a quiet period - there may be a large ledger cache left unpushed. Journal compaction events are about once per hour, so the performance overhead of a false push should be minimal, with the advantage of clearing any backlog before load starts again.
This is only relevant to riak users with very off/full batch type workloads.
* Extend tests
To more consistently trigger all overload scenarios
* Fix range keys smaller than prefix
Can't make end key an empty binary in this case, as it may be bigger than any keys within the range, but will appear to be smaller.
Unit tests and ct tests added to expose the potential issue
* Tidy-up
- Remove penciller logs which are no longer called
- Get pclerk to only wait MIN_TIMEOUT after doing work, in case there is a backlog
- Remove update_levelzero_cache function as it is unique to handle_call of push_mem, and simple enough to be inline
- Alight testutil slow offer with standard slow offer used
* Tidy-up
Remove pre-otp20 references.
Reinstate the check that the starting pid is still active, this was added to tidy up shutdown.
Resolve failure to run on otp20 due to `-if` sttaement
* Tidy up
Using null rather then {null, Key} is potentially clearer as it is not a concern what they Key is in this case, and removes a comparison step from the leveled_codec:endkey_passed/2 function.
There were issues with coverage in eunit tests as the leveled_pclerk shut down. This prompted a general tidy of leveled_pclerk (remove passing of LoopState into internal functions, and add dialyzer specs.
* Remove R16 relic
* Further testing another issue
The StartKey must always be less than or equal to the prefix when the first N characters are stripped, but this is not true of the EndKey (for the query) which does not have to be between the FirstKey and the LastKey.
If the EndKey query does not match it must be greater than the Prefix (as otherwise it would not have been greater than the FirstKey - so set to null.
* Fix unit test
Unit test had a typo - and result interpretation had a misunderstanding.
* Code and spec tidy
Also look to the cover the situation when the FirstKey is the same as the Prefix with tests.
This is, in theory, not an issue as it is the EndKey for each sublist which is indexed in leveled_tree. However, guard against it mapping to null here, just in case there are dangers lurking (note that tests will still pass without `M > N` guard in place.
* Hibernate on BIC complete
There are three situations when the BIC becomes complete:
- In a file created as part of a merge the BIS is learned in the merge
- After startup, files below L1 learn the block cache through reads that happen to read the block, eventually the while cache will be read, unless...
- Either before/after the cache is complete, it can get whiped by a timeout after a get_sqn request (e.g. as prompted by a journal compaction) ... it will then be re-filled of the back of get/get-range requests.
In all these situations we want to hibernate after the BIC is fill - to reflect the fact that the LoopState should now be relatively stable, so it is a good point to GC and rationalise location of data.
Previously on the the first base was covered. Now all three are covered through the bic_complete message.
* Test all index keys have same term
This works functionally, but is not optimised (the term is replicated in the index)
* Summaries with same index term
If the summary index all have the same index term - only the object keys need to be indexes
* Simplify case statements
We either match the pattern of <<Prefix:N, Suffix>> or the answer should be null
* OK for M == N
If M = N for the first key, it will have a suffix of <<>>. This will match (as expected) a query Start Key of the sam size, and be smaller than any query Start Key that has the same prefix.
If the query Start Key does not match the prefix - it will be null - as it must be smaller than the Prefix (as other wise the query Start Key would be bigger than the Last Key).
The constraint of M > N was introduced before the *_prefix_filter functions were checking the prefix, to avoid issues. Now the prefix is being checked, then M == N is ok.
* Simplify
Correct the test to use a binary field in the range.
To avoid further issue, only apply filter when everything is a binary() type.
* Add test for head_only mode
When leveled is used as a tictacaae key store (in parallel mode), the keys will be head_only entries. Double check they are handled as expected like object keys
* Revert previous change - must support typed buckets
Add assertion to confirm worthwhile optimisation
* Add support for configurable cache multiple (#375)
* Mas i370 patch e (#385)
Improvement to monitoring for efficiency and improved readability of logs and stats.
As part of this, where possible, tried to avoid updating loop state on READ messages in leveled processes (as was the case when tracking stats within each process).
No performance benefits found with change, but improved stats has helped discover other potential gains.
2022-12-18 20:18:03 +00:00
|
|
|
|
2017-11-21 19:58:36 +00:00
|
|
|
|
2017-05-22 18:09:12 +01:00
|
|
|
-type penciller_options() :: #penciller_options{}.
|
|
|
|
-type bookies_memory() :: {tuple()|empty_cache,
|
Develop 3.1 d30update (#386)
* Mas i370 patch d (#383)
* Refactor penciller memory
In high-volume tests on large key-count clusters, so significant variation in the P0031 time has been seen:
TimeBucket PatchA
a.0ms_to_1ms 18554
b.1ms_to_2ms 51778
c.2ms_to_3ms 696
d.3ms_to_5ms 220
e.5ms_to_8ms 59
f.8ms_to_13ms 40
g.13ms_to_21ms 364
h.21ms_to_34ms 277
i.34ms_to_55ms 34
j.55ms_to_89ms 17
k.89ms_to_144ms 21
l.144ms_to_233ms 31
m.233ms_to_377ms 45
n.377ms_to_610ms 52
o.610ms_to_987ms 59
p.987ms_to_1597ms 55
q.1597ms_to_2684ms 54
r.2684ms_to_4281ms 29
s.4281ms_to_6965ms 7
t.6295ms_to_11246ms 1
It is unclear why this varies so much. The time to add to the cache appears to be minimal (but perhaps there is an issue with timing points in the code), whereas the time to add to the index is much more significant and variable. There is also variable time when the memory is rolled (although the actual activity here appears to be minimal.
The refactoring here is two-fold:
- tidy and simplify by keeping LoopState managed within handle_call, and add more helpful dialyzer specs;
- change the update to the index to be a simple extension of a list, rather than any conversion.
This alternative version of the pmem index in unit test is orders of magnitude faster to add - and is the same order of magnitude to check. Anticipation is that it may be more efficient in terms of memory changes.
* Compress SST index
Reduces the size of the leveled_sst index with two changes:
1 - Where there is a common prefix of tuple elements (e.g. Bucket) across the whole leveled_sst file - only the non-common part is indexed, and a function is used to compare.
2 - There is less "indexing" of the index i.e. only 1 in 16 keys are passed into the gb_trees part instead of 1 in 4
* Immediate hibernate
Reasons for delay in hibernate were not clear.
Straight after creation the process will not be in receipt of messages (must wait for the manifest to be updated), so better to hibernate now. This also means the log PC023 provides more accurate information.
* Refactor BIC
This patch avoids the following:
- repeated replacement of the same element in the BIC (via get_kvrange), by checking presence via GET before sing SET
- Stops re-reading of all elements to discover high modified date
Also there appears to have been a bug where a missing HMD for the file is required to add to the cache. However, now the cache may be erased without erasing the HMD. This means that the cache can never be rebuilt
* Use correct size in test results
erts_debug:flat_size/1 returns size in words (i.e. 8 bytes on 64-bit CPU) not bytes
* Don't change summary record
As it is persisted as part of the file write, any change to the summary record cannot be rolled back
* Clerk to prompt L0 write
Simplifies the logic if the clerk request work for the penciller prompts L0 writes as well as Manifest changes.
The advantage now is that if the penciller memory is full, and PUT load stops, the clerk should still be able to prompt persistence. the penciller can therefore make use of dead time this way
* Add push on journal compact
If there has been a backlog, followed by a quiet period - there may be a large ledger cache left unpushed. Journal compaction events are about once per hour, so the performance overhead of a false push should be minimal, with the advantage of clearing any backlog before load starts again.
This is only relevant to riak users with very off/full batch type workloads.
* Extend tests
To more consistently trigger all overload scenarios
* Fix range keys smaller than prefix
Can't make end key an empty binary in this case, as it may be bigger than any keys within the range, but will appear to be smaller.
Unit tests and ct tests added to expose the potential issue
* Tidy-up
- Remove penciller logs which are no longer called
- Get pclerk to only wait MIN_TIMEOUT after doing work, in case there is a backlog
- Remove update_levelzero_cache function as it is unique to handle_call of push_mem, and simple enough to be inline
- Alight testutil slow offer with standard slow offer used
* Tidy-up
Remove pre-otp20 references.
Reinstate the check that the starting pid is still active, this was added to tidy up shutdown.
Resolve failure to run on otp20 due to `-if` sttaement
* Tidy up
Using null rather then {null, Key} is potentially clearer as it is not a concern what they Key is in this case, and removes a comparison step from the leveled_codec:endkey_passed/2 function.
There were issues with coverage in eunit tests as the leveled_pclerk shut down. This prompted a general tidy of leveled_pclerk (remove passing of LoopState into internal functions, and add dialyzer specs.
* Remove R16 relic
* Further testing another issue
The StartKey must always be less than or equal to the prefix when the first N characters are stripped, but this is not true of the EndKey (for the query) which does not have to be between the FirstKey and the LastKey.
If the EndKey query does not match it must be greater than the Prefix (as otherwise it would not have been greater than the FirstKey - so set to null.
* Fix unit test
Unit test had a typo - and result interpretation had a misunderstanding.
* Code and spec tidy
Also look to the cover the situation when the FirstKey is the same as the Prefix with tests.
This is, in theory, not an issue as it is the EndKey for each sublist which is indexed in leveled_tree. However, guard against it mapping to null here, just in case there are dangers lurking (note that tests will still pass without `M > N` guard in place.
* Hibernate on BIC complete
There are three situations when the BIC becomes complete:
- In a file created as part of a merge the BIS is learned in the merge
- After startup, files below L1 learn the block cache through reads that happen to read the block, eventually the while cache will be read, unless...
- Either before/after the cache is complete, it can get whiped by a timeout after a get_sqn request (e.g. as prompted by a journal compaction) ... it will then be re-filled of the back of get/get-range requests.
In all these situations we want to hibernate after the BIC is fill - to reflect the fact that the LoopState should now be relatively stable, so it is a good point to GC and rationalise location of data.
Previously on the the first base was covered. Now all three are covered through the bic_complete message.
* Test all index keys have same term
This works functionally, but is not optimised (the term is replicated in the index)
* Summaries with same index term
If the summary index all have the same index term - only the object keys need to be indexes
* Simplify case statements
We either match the pattern of <<Prefix:N, Suffix>> or the answer should be null
* OK for M == N
If M = N for the first key, it will have a suffix of <<>>. This will match (as expected) a query Start Key of the sam size, and be smaller than any query Start Key that has the same prefix.
If the query Start Key does not match the prefix - it will be null - as it must be smaller than the Prefix (as other wise the query Start Key would be bigger than the Last Key).
The constraint of M > N was introduced before the *_prefix_filter functions were checking the prefix, to avoid issues. Now the prefix is being checked, then M == N is ok.
* Simplify
Correct the test to use a binary field in the range.
To avoid further issue, only apply filter when everything is a binary() type.
* Add test for head_only mode
When leveled is used as a tictacaae key store (in parallel mode), the keys will be head_only entries. Double check they are handled as expected like object keys
* Revert previous change - must support typed buckets
Add assertion to confirm worthwhile optimisation
* Add support for configurable cache multiple (#375)
* Mas i370 patch e (#385)
Improvement to monitoring for efficiency and improved readability of logs and stats.
As part of this, where possible, tried to avoid updating loop state on READ messages in leveled processes (as was the case when tracking stats within each process).
No performance benefits found with change, but improved stats has helped discover other potential gains.
2022-12-18 20:18:03 +00:00
|
|
|
array:array()|empty_array,
|
2017-05-22 18:09:12 +01:00
|
|
|
integer()|infinity,
|
|
|
|
integer()}.
|
|
|
|
-type pcl_state() :: #state{}.
|
2018-11-05 16:02:19 +00:00
|
|
|
-type levelzero_cacheentry() :: {pos_integer(), leveled_tree:leveled_tree()}.
|
2018-10-30 19:35:29 +00:00
|
|
|
-type levelzero_cache() :: list(levelzero_cacheentry()).
|
|
|
|
-type iterator_entry()
|
|
|
|
:: {pos_integer(),
|
|
|
|
list(leveled_codec:ledger_kv()|leveled_sst:expandable_pointer())}.
|
|
|
|
-type iterator() :: list(iterator_entry()).
|
2018-12-07 00:48:42 +00:00
|
|
|
-type bad_ledgerkey() :: list().
|
2020-03-09 15:12:48 +00:00
|
|
|
-type sqn_check() :: current|replaced|missing.
|
2020-12-04 12:49:17 +00:00
|
|
|
-type sst_fetchfun() ::
|
|
|
|
fun((pid(),
|
|
|
|
leveled_codec:ledger_key(),
|
|
|
|
leveled_codec:segment_hash(),
|
2021-08-23 17:18:45 +01:00
|
|
|
non_neg_integer()) ->
|
|
|
|
leveled_codec:ledger_kv()|not_present).
|
|
|
|
-type levelzero_returnfun() :: fun((levelzero_cacheentry()) -> ok).
|
|
|
|
-type pclacc_fun() ::
|
|
|
|
fun((leveled_codec:ledger_key(),
|
|
|
|
leveled_codec:ledger_value(),
|
|
|
|
any()) -> any()).
|
Develop 3.1 d30update (#386)
* Mas i370 patch d (#383)
* Refactor penciller memory
In high-volume tests on large key-count clusters, so significant variation in the P0031 time has been seen:
TimeBucket PatchA
a.0ms_to_1ms 18554
b.1ms_to_2ms 51778
c.2ms_to_3ms 696
d.3ms_to_5ms 220
e.5ms_to_8ms 59
f.8ms_to_13ms 40
g.13ms_to_21ms 364
h.21ms_to_34ms 277
i.34ms_to_55ms 34
j.55ms_to_89ms 17
k.89ms_to_144ms 21
l.144ms_to_233ms 31
m.233ms_to_377ms 45
n.377ms_to_610ms 52
o.610ms_to_987ms 59
p.987ms_to_1597ms 55
q.1597ms_to_2684ms 54
r.2684ms_to_4281ms 29
s.4281ms_to_6965ms 7
t.6295ms_to_11246ms 1
It is unclear why this varies so much. The time to add to the cache appears to be minimal (but perhaps there is an issue with timing points in the code), whereas the time to add to the index is much more significant and variable. There is also variable time when the memory is rolled (although the actual activity here appears to be minimal.
The refactoring here is two-fold:
- tidy and simplify by keeping LoopState managed within handle_call, and add more helpful dialyzer specs;
- change the update to the index to be a simple extension of a list, rather than any conversion.
This alternative version of the pmem index in unit test is orders of magnitude faster to add - and is the same order of magnitude to check. Anticipation is that it may be more efficient in terms of memory changes.
* Compress SST index
Reduces the size of the leveled_sst index with two changes:
1 - Where there is a common prefix of tuple elements (e.g. Bucket) across the whole leveled_sst file - only the non-common part is indexed, and a function is used to compare.
2 - There is less "indexing" of the index i.e. only 1 in 16 keys are passed into the gb_trees part instead of 1 in 4
* Immediate hibernate
Reasons for delay in hibernate were not clear.
Straight after creation the process will not be in receipt of messages (must wait for the manifest to be updated), so better to hibernate now. This also means the log PC023 provides more accurate information.
* Refactor BIC
This patch avoids the following:
- repeated replacement of the same element in the BIC (via get_kvrange), by checking presence via GET before sing SET
- Stops re-reading of all elements to discover high modified date
Also there appears to have been a bug where a missing HMD for the file is required to add to the cache. However, now the cache may be erased without erasing the HMD. This means that the cache can never be rebuilt
* Use correct size in test results
erts_debug:flat_size/1 returns size in words (i.e. 8 bytes on 64-bit CPU) not bytes
* Don't change summary record
As it is persisted as part of the file write, any change to the summary record cannot be rolled back
* Clerk to prompt L0 write
Simplifies the logic if the clerk request work for the penciller prompts L0 writes as well as Manifest changes.
The advantage now is that if the penciller memory is full, and PUT load stops, the clerk should still be able to prompt persistence. the penciller can therefore make use of dead time this way
* Add push on journal compact
If there has been a backlog, followed by a quiet period - there may be a large ledger cache left unpushed. Journal compaction events are about once per hour, so the performance overhead of a false push should be minimal, with the advantage of clearing any backlog before load starts again.
This is only relevant to riak users with very off/full batch type workloads.
* Extend tests
To more consistently trigger all overload scenarios
* Fix range keys smaller than prefix
Can't make end key an empty binary in this case, as it may be bigger than any keys within the range, but will appear to be smaller.
Unit tests and ct tests added to expose the potential issue
* Tidy-up
- Remove penciller logs which are no longer called
- Get pclerk to only wait MIN_TIMEOUT after doing work, in case there is a backlog
- Remove update_levelzero_cache function as it is unique to handle_call of push_mem, and simple enough to be inline
- Alight testutil slow offer with standard slow offer used
* Tidy-up
Remove pre-otp20 references.
Reinstate the check that the starting pid is still active, this was added to tidy up shutdown.
Resolve failure to run on otp20 due to `-if` sttaement
* Tidy up
Using null rather then {null, Key} is potentially clearer as it is not a concern what they Key is in this case, and removes a comparison step from the leveled_codec:endkey_passed/2 function.
There were issues with coverage in eunit tests as the leveled_pclerk shut down. This prompted a general tidy of leveled_pclerk (remove passing of LoopState into internal functions, and add dialyzer specs.
* Remove R16 relic
* Further testing another issue
The StartKey must always be less than or equal to the prefix when the first N characters are stripped, but this is not true of the EndKey (for the query) which does not have to be between the FirstKey and the LastKey.
If the EndKey query does not match it must be greater than the Prefix (as otherwise it would not have been greater than the FirstKey - so set to null.
* Fix unit test
Unit test had a typo - and result interpretation had a misunderstanding.
* Code and spec tidy
Also look to the cover the situation when the FirstKey is the same as the Prefix with tests.
This is, in theory, not an issue as it is the EndKey for each sublist which is indexed in leveled_tree. However, guard against it mapping to null here, just in case there are dangers lurking (note that tests will still pass without `M > N` guard in place.
* Hibernate on BIC complete
There are three situations when the BIC becomes complete:
- In a file created as part of a merge the BIS is learned in the merge
- After startup, files below L1 learn the block cache through reads that happen to read the block, eventually the while cache will be read, unless...
- Either before/after the cache is complete, it can get whiped by a timeout after a get_sqn request (e.g. as prompted by a journal compaction) ... it will then be re-filled of the back of get/get-range requests.
In all these situations we want to hibernate after the BIC is fill - to reflect the fact that the LoopState should now be relatively stable, so it is a good point to GC and rationalise location of data.
Previously on the the first base was covered. Now all three are covered through the bic_complete message.
* Test all index keys have same term
This works functionally, but is not optimised (the term is replicated in the index)
* Summaries with same index term
If the summary index all have the same index term - only the object keys need to be indexes
* Simplify case statements
We either match the pattern of <<Prefix:N, Suffix>> or the answer should be null
* OK for M == N
If M = N for the first key, it will have a suffix of <<>>. This will match (as expected) a query Start Key of the sam size, and be smaller than any query Start Key that has the same prefix.
If the query Start Key does not match the prefix - it will be null - as it must be smaller than the Prefix (as other wise the query Start Key would be bigger than the Last Key).
The constraint of M > N was introduced before the *_prefix_filter functions were checking the prefix, to avoid issues. Now the prefix is being checked, then M == N is ok.
* Simplify
Correct the test to use a binary field in the range.
To avoid further issue, only apply filter when everything is a binary() type.
* Add test for head_only mode
When leveled is used as a tictacaae key store (in parallel mode), the keys will be head_only entries. Double check they are handled as expected like object keys
* Revert previous change - must support typed buckets
Add assertion to confirm worthwhile optimisation
* Add support for configurable cache multiple (#375)
* Mas i370 patch e (#385)
Improvement to monitoring for efficiency and improved readability of logs and stats.
As part of this, where possible, tried to avoid updating loop state on READ messages in leveled processes (as was the case when tracking stats within each process).
No performance benefits found with change, but improved stats has helped discover other potential gains.
2022-12-18 20:18:03 +00:00
|
|
|
-type sst_options() :: #sst_options{}.
|
2016-08-09 16:09:29 +01:00
|
|
|
|
2021-08-23 17:18:45 +01:00
|
|
|
-export_type([levelzero_cacheentry/0, levelzero_returnfun/0, sqn_check/0]).
|
2019-02-26 18:16:47 +00:00
|
|
|
|
2016-08-09 16:09:29 +01:00
|
|
|
%%%============================================================================
|
|
|
|
%%% API
|
|
|
|
%%%============================================================================
|
2016-09-15 18:38:23 +01:00
|
|
|
|
2017-05-22 18:09:12 +01:00
|
|
|
-spec pcl_start(penciller_options()) -> {ok, pid()}.
|
|
|
|
%% @doc
|
|
|
|
%% Start a penciller using a penciller options record. The start_snapshot
|
|
|
|
%% option should be used if this is to be a clone of an existing penciller,
|
|
|
|
%% otherwise the penciller will look in root path for a manifest and
|
|
|
|
%% associated sst files to start-up from a previous persisted state.
|
|
|
|
%%
|
|
|
|
%% When starting a clone a query can also be passed. This prevents the whole
|
|
|
|
%% Level Zero memory space from being copied to the snapshot, instead the
|
|
|
|
%% query is run against the level zero space and just the query results are
|
2021-10-27 13:42:53 +01:00
|
|
|
%% copied into the clone.
|
2016-09-08 14:21:30 +01:00
|
|
|
pcl_start(PCLopts) ->
|
2018-12-10 16:09:11 +01:00
|
|
|
gen_server:start_link(?MODULE, [leveled_log:get_opts(), PCLopts], []).
|
2018-07-10 10:25:20 +01:00
|
|
|
|
|
|
|
-spec pcl_snapstart(penciller_options()) -> {ok, pid()}.
|
|
|
|
%% @doc
|
|
|
|
%% Don't link to the bookie - this is a snpashot
|
|
|
|
pcl_snapstart(PCLopts) ->
|
2018-12-10 16:09:11 +01:00
|
|
|
gen_server:start(?MODULE, [leveled_log:get_opts(), PCLopts], []).
|
2016-08-09 16:09:29 +01:00
|
|
|
|
2017-05-22 18:09:12 +01:00
|
|
|
-spec pcl_pushmem(pid(), bookies_memory()) -> ok|returned.
|
|
|
|
%% @doc
|
|
|
|
%% Load the contents of the Bookie's memory of recent additions to the Ledger
|
|
|
|
%% to the Ledger proper.
|
|
|
|
%%
|
|
|
|
%% The load is made up of a cache in the form of a leveled_skiplist tuple (or
|
|
|
|
%% the atom empty_cache if no cache is present), an index of entries in the
|
|
|
|
%% skiplist in the form of leveled_pmem index (or empty_index), the minimum
|
|
|
|
%% sequence number in the cache and the maximum sequence number.
|
|
|
|
%%
|
|
|
|
%% If the penciller does not have capacity for the pushed cache it will
|
|
|
|
%% respond with the atom 'returned'. This is a signal to hold the memory
|
|
|
|
%% at the Bookie, and try again soon. This normally only occurs when there
|
|
|
|
%% is a backlog of merges - so the bookie should backoff for longer each time.
|
2016-12-11 01:02:56 +00:00
|
|
|
pcl_pushmem(Pid, LedgerCache) ->
|
2016-08-09 16:09:29 +01:00
|
|
|
%% Bookie to dump memory onto penciller
|
2016-12-11 01:02:56 +00:00
|
|
|
gen_server:call(Pid, {push_mem, LedgerCache}, infinity).
|
2016-10-31 01:33:33 +00:00
|
|
|
|
2020-12-04 12:49:17 +00:00
|
|
|
-spec pcl_fetchlevelzero(pid(),
|
|
|
|
non_neg_integer(),
|
|
|
|
fun((levelzero_cacheentry()) -> ok))
|
|
|
|
-> ok.
|
2017-05-22 18:09:12 +01:00
|
|
|
%% @doc
|
|
|
|
%% Allows a single slot of the penciller's levelzero cache to be fetched. The
|
|
|
|
%% levelzero cache can be up to 40K keys - sending this to the process that is
|
|
|
|
%% persisting this in a SST file in a single cast will lock the process for
|
|
|
|
%% 30-40ms. This allows that process to fetch this slot by slot, so that
|
|
|
|
%% this is split into a series of smaller events.
|
|
|
|
%%
|
|
|
|
%% The return value will be a leveled_skiplist that forms that part of the
|
|
|
|
%% cache
|
2019-02-26 18:16:47 +00:00
|
|
|
pcl_fetchlevelzero(Pid, Slot, ReturnFun) ->
|
2017-11-21 19:58:36 +00:00
|
|
|
% Timeout to cause crash of L0 file when it can't get the close signal
|
|
|
|
% as it is deadlocked making this call.
|
|
|
|
%
|
|
|
|
% If the timeout gets hit outside of close scenario the Penciller will
|
|
|
|
% be stuck in L0 pending
|
2019-02-26 18:16:47 +00:00
|
|
|
gen_server:cast(Pid, {fetch_levelzero, Slot, ReturnFun}).
|
2017-05-22 18:09:12 +01:00
|
|
|
|
2018-05-04 15:24:08 +01:00
|
|
|
-spec pcl_fetch(pid(),
|
|
|
|
leveled_codec:ledger_key(),
|
2018-06-23 15:15:49 +01:00
|
|
|
leveled_codec:segment_hash(),
|
|
|
|
boolean()) -> leveled_codec:ledger_kv()|not_present.
|
2017-05-22 18:09:12 +01:00
|
|
|
%% @doc
|
|
|
|
%% Fetch a key, return the first (highest SQN) occurrence of that Key along
|
|
|
|
%% with the value.
|
|
|
|
%%
|
2017-10-20 23:04:29 +01:00
|
|
|
%% Hash should be result of leveled_codec:segment_hash(Key)
|
2018-06-23 15:15:49 +01:00
|
|
|
pcl_fetch(Pid, Key, Hash, UseL0Index) ->
|
|
|
|
gen_server:call(Pid, {fetch, Key, Hash, UseL0Index}, infinity).
|
2016-08-09 16:09:29 +01:00
|
|
|
|
2018-05-04 15:24:08 +01:00
|
|
|
-spec pcl_fetchkeys(pid(),
|
|
|
|
leveled_codec:ledger_key(),
|
|
|
|
leveled_codec:ledger_key(),
|
2020-12-04 12:49:17 +00:00
|
|
|
pclacc_fun(), any(), as_pcl|by_runner) -> any().
|
2017-05-22 18:09:12 +01:00
|
|
|
%% @doc
|
|
|
|
%% Run a range query between StartKey and EndKey (inclusive). This will cover
|
|
|
|
%% all keys in the range - so must only be run against snapshots of the
|
|
|
|
%% penciller to avoid blocking behaviour.
|
|
|
|
%%
|
|
|
|
%% Comparison with the upper-end of the range (EndKey) is done using
|
|
|
|
%% leveled_codec:endkey_passed/2 - so use nulls within the tuple to manage
|
|
|
|
%% the top of the range. Comparison with the start of the range is based on
|
|
|
|
%% Erlang term order.
|
2016-10-12 17:12:49 +01:00
|
|
|
pcl_fetchkeys(Pid, StartKey, EndKey, AccFun, InitAcc) ->
|
2018-09-24 19:54:28 +01:00
|
|
|
pcl_fetchkeys(Pid, StartKey, EndKey, AccFun, InitAcc, as_pcl).
|
|
|
|
|
|
|
|
pcl_fetchkeys(Pid, StartKey, EndKey, AccFun, InitAcc, By) ->
|
2016-10-12 17:12:49 +01:00
|
|
|
gen_server:call(Pid,
|
2017-10-31 23:28:35 +00:00
|
|
|
{fetch_keys,
|
|
|
|
StartKey, EndKey,
|
|
|
|
AccFun, InitAcc,
|
2018-10-31 00:09:24 +00:00
|
|
|
false, false, -1,
|
2018-09-24 19:54:28 +01:00
|
|
|
By},
|
2017-10-31 23:28:35 +00:00
|
|
|
infinity).
|
|
|
|
|
2018-09-24 19:54:28 +01:00
|
|
|
|
2018-05-04 15:24:08 +01:00
|
|
|
-spec pcl_fetchkeysbysegment(pid(),
|
|
|
|
leveled_codec:ledger_key(),
|
|
|
|
leveled_codec:ledger_key(),
|
2020-12-04 12:49:17 +00:00
|
|
|
pclacc_fun(), any(),
|
2018-10-31 00:09:24 +00:00
|
|
|
leveled_codec:segment_list(),
|
|
|
|
false | leveled_codec:lastmod_range(),
|
2018-11-01 23:40:28 +00:00
|
|
|
boolean()) -> any().
|
2017-10-31 23:28:35 +00:00
|
|
|
%% @doc
|
|
|
|
%% Run a range query between StartKey and EndKey (inclusive). This will cover
|
|
|
|
%% all keys in the range - so must only be run against snapshots of the
|
|
|
|
%% penciller to avoid blocking behaviour.
|
|
|
|
%%
|
|
|
|
%% This version allows an additional input of a SegmentList. This is a list
|
|
|
|
%% of 16-bit integers representing the segment IDs band ((2 ^ 16) -1) that
|
|
|
|
%% are interesting to the fetch
|
|
|
|
%%
|
|
|
|
%% Note that segment must be false unless the object Tag supports additional
|
|
|
|
%% indexing by segment. This cannot be used on ?IDX_TAG and other tags that
|
|
|
|
%% use the no_lookup hash
|
2018-10-31 00:09:24 +00:00
|
|
|
pcl_fetchkeysbysegment(Pid, StartKey, EndKey, AccFun, InitAcc,
|
2018-11-01 23:40:28 +00:00
|
|
|
SegmentList, LastModRange, LimitByCount) ->
|
|
|
|
{MaxKeys, InitAcc0} =
|
|
|
|
case LimitByCount of
|
|
|
|
true ->
|
|
|
|
% The passed in accumulator should have the Max Key Count
|
|
|
|
% as the first element of a tuple with the actual accumulator
|
|
|
|
InitAcc;
|
2018-10-31 00:09:24 +00:00
|
|
|
false ->
|
2018-11-01 23:40:28 +00:00
|
|
|
{-1, InitAcc}
|
2018-10-31 00:09:24 +00:00
|
|
|
end,
|
2017-10-31 23:28:35 +00:00
|
|
|
gen_server:call(Pid,
|
|
|
|
{fetch_keys,
|
2018-11-01 23:40:28 +00:00
|
|
|
StartKey, EndKey, AccFun, InitAcc0,
|
|
|
|
SegmentList, LastModRange, MaxKeys,
|
2018-11-23 16:00:11 +00:00
|
|
|
by_runner},
|
2016-11-20 21:21:31 +00:00
|
|
|
infinity).
|
|
|
|
|
2018-05-04 15:24:08 +01:00
|
|
|
-spec pcl_fetchnextkey(pid(),
|
|
|
|
leveled_codec:ledger_key(),
|
|
|
|
leveled_codec:ledger_key(),
|
2020-12-04 12:49:17 +00:00
|
|
|
pclacc_fun(), any()) -> any().
|
2017-05-22 18:09:12 +01:00
|
|
|
%% @doc
|
|
|
|
%% Run a range query between StartKey and EndKey (inclusive). This has the
|
|
|
|
%% same constraints as pcl_fetchkeys/5, but will only return the first key
|
|
|
|
%% found in erlang term order.
|
2016-11-20 21:21:31 +00:00
|
|
|
pcl_fetchnextkey(Pid, StartKey, EndKey, AccFun, InitAcc) ->
|
|
|
|
gen_server:call(Pid,
|
2017-10-31 23:28:35 +00:00
|
|
|
{fetch_keys,
|
|
|
|
StartKey, EndKey,
|
|
|
|
AccFun, InitAcc,
|
2018-10-31 00:09:24 +00:00
|
|
|
false, false, 1,
|
2018-09-24 19:54:28 +01:00
|
|
|
as_pcl},
|
2016-10-12 17:12:49 +01:00
|
|
|
infinity).
|
|
|
|
|
2018-05-04 15:24:08 +01:00
|
|
|
-spec pcl_checksequencenumber(pid(),
|
2018-12-07 00:48:42 +00:00
|
|
|
leveled_codec:ledger_key()|bad_ledgerkey(),
|
2020-03-09 15:12:48 +00:00
|
|
|
integer()) -> sqn_check().
|
2017-05-22 18:09:12 +01:00
|
|
|
%% @doc
|
|
|
|
%% Check if the sequence number of the passed key is not replaced by a change
|
2020-03-09 15:12:48 +00:00
|
|
|
%% after the passed sequence number. Will return:
|
|
|
|
%% - current
|
|
|
|
%% - replaced
|
|
|
|
%% - missing
|
2016-09-26 10:55:08 +01:00
|
|
|
pcl_checksequencenumber(Pid, Key, SQN) ->
|
2018-12-07 09:07:22 +00:00
|
|
|
Hash = leveled_codec:segment_hash(Key),
|
|
|
|
if
|
|
|
|
Hash /= no_lookup ->
|
|
|
|
gen_server:call(Pid, {check_sqn, Key, Hash, SQN}, infinity)
|
2016-12-11 01:02:56 +00:00
|
|
|
end.
|
|
|
|
|
2017-05-22 18:09:12 +01:00
|
|
|
-spec pcl_workforclerk(pid()) -> ok.
|
|
|
|
%% @doc
|
|
|
|
%% A request from the clerk to check for work. If work is present the
|
|
|
|
%% Penciller will cast back to the clerk, no response is sent to this
|
|
|
|
%% request.
|
2016-08-09 16:09:29 +01:00
|
|
|
pcl_workforclerk(Pid) ->
|
2017-01-15 00:52:43 +00:00
|
|
|
gen_server:cast(Pid, work_for_clerk).
|
2016-08-09 16:09:29 +01:00
|
|
|
|
2018-05-04 15:24:08 +01:00
|
|
|
-spec pcl_manifestchange(pid(), leveled_pmanifest:manifest()) -> ok.
|
2017-05-22 18:09:12 +01:00
|
|
|
%% @doc
|
|
|
|
%% Provide a manifest record (i.e. the output of the leveled_pmanifest module)
|
2023-03-13 11:46:08 +00:00
|
|
|
%% that is required to become the new manifest.
|
2017-01-14 16:36:05 +00:00
|
|
|
pcl_manifestchange(Pid, Manifest) ->
|
2017-01-13 18:23:57 +00:00
|
|
|
gen_server:cast(Pid, {manifest_change, Manifest}).
|
2016-08-10 13:02:08 +01:00
|
|
|
|
2018-05-04 15:24:08 +01:00
|
|
|
-spec pcl_confirml0complete(pid(),
|
|
|
|
string(),
|
|
|
|
leveled_codec:ledger_key(),
|
|
|
|
leveled_codec:ledger_key(),
|
|
|
|
binary()) -> ok.
|
2017-05-22 18:09:12 +01:00
|
|
|
%% @doc
|
|
|
|
%% Allows a SST writer that has written a L0 file to confirm that the file
|
|
|
|
%% is now complete, so the filename and key ranges can be added to the
|
|
|
|
%% manifest and the file can be used in place of the in-memory levelzero
|
|
|
|
%% cache.
|
2017-11-28 01:19:30 +00:00
|
|
|
pcl_confirml0complete(Pid, FN, StartKey, EndKey, Bloom) ->
|
|
|
|
gen_server:cast(Pid, {levelzero_complete, FN, StartKey, EndKey, Bloom}).
|
2016-11-05 11:22:27 +00:00
|
|
|
|
2017-05-22 18:09:12 +01:00
|
|
|
-spec pcl_confirmdelete(pid(), string(), pid()) -> ok.
|
|
|
|
%% @doc
|
|
|
|
%% Poll from a delete_pending file requesting a message if the file is now
|
|
|
|
%% ready for deletion (i.e. all snapshots which depend on the file have
|
|
|
|
%% finished)
|
2017-01-17 11:18:58 +00:00
|
|
|
pcl_confirmdelete(Pid, FileName, FilePid) ->
|
|
|
|
gen_server:cast(Pid, {confirm_delete, FileName, FilePid}).
|
2016-08-09 16:09:29 +01:00
|
|
|
|
2017-05-22 18:09:12 +01:00
|
|
|
-spec pcl_getstartupsequencenumber(pid()) -> integer().
|
|
|
|
%% @doc
|
|
|
|
%% At startup the penciller will get the largest sequence number that is
|
|
|
|
%% within the persisted files. This function allows for this sequence number
|
|
|
|
%% to be fetched - so that it can be used to determine parts of the Ledger
|
|
|
|
%% which may have been lost in the last shutdown (so that the ledger can
|
|
|
|
%% be reloaded from that point in the Journal)
|
2016-08-15 16:43:39 +01:00
|
|
|
pcl_getstartupsequencenumber(Pid) ->
|
2016-09-21 18:31:42 +01:00
|
|
|
gen_server:call(Pid, get_startup_sqn, infinity).
|
|
|
|
|
2017-05-22 18:09:12 +01:00
|
|
|
-spec pcl_registersnapshot(pid(),
|
|
|
|
pid(),
|
|
|
|
no_lookup|{tuple(), tuple()}|undefined,
|
|
|
|
bookies_memory(),
|
|
|
|
boolean())
|
|
|
|
-> {ok, pcl_state()}.
|
|
|
|
%% @doc
|
|
|
|
%% Register a snapshot of the penciller, returning a state record from the
|
|
|
|
%% penciller for the snapshot to use as its LoopData
|
2017-04-05 09:16:01 +01:00
|
|
|
pcl_registersnapshot(Pid, Snapshot, Query, BookiesMem, LR) ->
|
2017-03-06 18:42:32 +00:00
|
|
|
gen_server:call(Pid,
|
2017-04-05 09:16:01 +01:00
|
|
|
{register_snapshot, Snapshot, Query, BookiesMem, LR},
|
2017-03-06 18:42:32 +00:00
|
|
|
infinity).
|
2016-09-21 18:31:42 +01:00
|
|
|
|
2017-05-22 18:09:12 +01:00
|
|
|
-spec pcl_releasesnapshot(pid(), pid()) -> ok.
|
|
|
|
%% @doc
|
|
|
|
%% Inform the primary penciller that a snapshot is finished, so that the
|
|
|
|
%% penciller can allow deletes to proceed if appropriate.
|
2016-10-12 17:12:49 +01:00
|
|
|
pcl_releasesnapshot(Pid, Snapshot) ->
|
|
|
|
gen_server:cast(Pid, {release_snapshot, Snapshot}).
|
|
|
|
|
2018-02-15 16:14:46 +00:00
|
|
|
|
|
|
|
-spec pcl_persistedsqn(pid()) -> integer().
|
|
|
|
%% @doc
|
|
|
|
%% Return the persisted SQN, the highest SQN which has been persisted into the
|
|
|
|
%% Ledger
|
|
|
|
pcl_persistedsqn(Pid) ->
|
|
|
|
gen_server:call(Pid, persisted_sqn, infinity).
|
|
|
|
|
2017-05-22 18:09:12 +01:00
|
|
|
-spec pcl_close(pid()) -> ok.
|
|
|
|
%% @doc
|
|
|
|
%% Close the penciller neatly, trying to persist to disk anything in the memory
|
2016-08-15 16:43:39 +01:00
|
|
|
pcl_close(Pid) ->
|
2023-10-03 18:30:40 +01:00
|
|
|
gen_server:call(Pid, close, infinity).
|
|
|
|
|
|
|
|
-spec pcl_snapclose(pid()) -> ok.
|
|
|
|
%% @doc
|
|
|
|
%% Specifically to be used when closing snpashots on shutdown, will handle a
|
|
|
|
%% scenario where a snapshot has already exited
|
|
|
|
pcl_snapclose(Pid) ->
|
|
|
|
try
|
|
|
|
pcl_close(Pid)
|
|
|
|
catch
|
|
|
|
exit:{noproc, _CallDetails} ->
|
|
|
|
ok
|
|
|
|
end.
|
2016-08-15 16:43:39 +01:00
|
|
|
|
2017-05-22 18:09:12 +01:00
|
|
|
-spec pcl_doom(pid()) -> {ok, list()}.
|
|
|
|
%% @doc
|
|
|
|
%% Close the penciller neatly, trying to persist to disk anything in the memory
|
|
|
|
%% Return a list of filepaths from where files exist for this penciller (should
|
|
|
|
%% the calling process which to erase the store).
|
2016-11-21 12:34:40 +00:00
|
|
|
pcl_doom(Pid) ->
|
2023-10-03 18:30:40 +01:00
|
|
|
gen_server:call(Pid, doom, infinity).
|
2016-09-21 18:31:42 +01:00
|
|
|
|
2017-11-28 11:43:46 +00:00
|
|
|
-spec pcl_checkbloomtest(pid(), tuple()) -> boolean().
|
|
|
|
%% @doc
|
|
|
|
%% Function specifically added to help testing. In particular to make sure
|
|
|
|
%% that blooms are still available after pencllers have been re-loaded from
|
|
|
|
%% disk.
|
|
|
|
pcl_checkbloomtest(Pid, Key) ->
|
|
|
|
Hash = leveled_codec:segment_hash(Key),
|
|
|
|
if
|
|
|
|
Hash /= no_lookup ->
|
|
|
|
gen_server:call(Pid, {checkbloom_fortest, Key, Hash}, 2000)
|
|
|
|
end.
|
|
|
|
|
|
|
|
-spec pcl_checkforwork(pid()) -> boolean().
|
|
|
|
%% @doc
|
|
|
|
%% Used in test only to confim compaction work complete before closing
|
|
|
|
pcl_checkforwork(Pid) ->
|
|
|
|
gen_server:call(Pid, check_for_work, 2000).
|
|
|
|
|
2018-12-11 21:59:57 +00:00
|
|
|
-spec pcl_loglevel(pid(), leveled_log:log_level()) -> ok.
|
|
|
|
%% @doc
|
|
|
|
%% Change the log level of the Journal
|
|
|
|
pcl_loglevel(Pid, LogLevel) ->
|
|
|
|
gen_server:cast(Pid, {log_level, LogLevel}).
|
|
|
|
|
|
|
|
-spec pcl_addlogs(pid(), list(string())) -> ok.
|
|
|
|
%% @doc
|
|
|
|
%% Add to the list of forced logs, a list of more forced logs
|
|
|
|
pcl_addlogs(Pid, ForcedLogs) ->
|
|
|
|
gen_server:cast(Pid, {add_logs, ForcedLogs}).
|
|
|
|
|
|
|
|
-spec pcl_removelogs(pid(), list(string())) -> ok.
|
|
|
|
%% @doc
|
|
|
|
%% Remove from the list of forced logs, a list of forced logs
|
|
|
|
pcl_removelogs(Pid, ForcedLogs) ->
|
|
|
|
gen_server:cast(Pid, {remove_logs, ForcedLogs}).
|
|
|
|
|
2016-08-09 16:09:29 +01:00
|
|
|
%%%============================================================================
|
|
|
|
%%% gen_server callbacks
|
|
|
|
%%%============================================================================
|
|
|
|
|
2018-12-10 16:09:11 +01:00
|
|
|
init([LogOpts, PCLopts]) ->
|
|
|
|
leveled_log:save(LogOpts),
|
2017-07-31 20:20:39 +02:00
|
|
|
leveled_rand:seed(),
|
2016-09-23 18:50:29 +01:00
|
|
|
case {PCLopts#penciller_options.root_path,
|
2017-03-06 18:42:32 +00:00
|
|
|
PCLopts#penciller_options.start_snapshot,
|
|
|
|
PCLopts#penciller_options.snapshot_query,
|
|
|
|
PCLopts#penciller_options.bookies_mem} of
|
2017-04-19 22:46:37 +01:00
|
|
|
{undefined, _Snapshot=true, Query, BookiesMem} ->
|
2016-09-23 18:50:29 +01:00
|
|
|
SrcPenciller = PCLopts#penciller_options.source_penciller,
|
2017-04-05 09:16:01 +01:00
|
|
|
LongRunning = PCLopts#penciller_options.snapshot_longrunning,
|
2018-08-16 10:37:30 +01:00
|
|
|
%% monitor the bookie, and close the snapshot when bookie
|
|
|
|
%% exits
|
2018-12-14 11:23:04 +00:00
|
|
|
BookieMonitor =
|
|
|
|
erlang:monitor(process, PCLopts#penciller_options.bookies_pid),
|
2023-11-08 09:18:01 +00:00
|
|
|
{ok, State} =
|
|
|
|
pcl_registersnapshot(
|
|
|
|
SrcPenciller, self(), Query, BookiesMem, LongRunning),
|
Develop 3.1 d30update (#386)
* Mas i370 patch d (#383)
* Refactor penciller memory
In high-volume tests on large key-count clusters, so significant variation in the P0031 time has been seen:
TimeBucket PatchA
a.0ms_to_1ms 18554
b.1ms_to_2ms 51778
c.2ms_to_3ms 696
d.3ms_to_5ms 220
e.5ms_to_8ms 59
f.8ms_to_13ms 40
g.13ms_to_21ms 364
h.21ms_to_34ms 277
i.34ms_to_55ms 34
j.55ms_to_89ms 17
k.89ms_to_144ms 21
l.144ms_to_233ms 31
m.233ms_to_377ms 45
n.377ms_to_610ms 52
o.610ms_to_987ms 59
p.987ms_to_1597ms 55
q.1597ms_to_2684ms 54
r.2684ms_to_4281ms 29
s.4281ms_to_6965ms 7
t.6295ms_to_11246ms 1
It is unclear why this varies so much. The time to add to the cache appears to be minimal (but perhaps there is an issue with timing points in the code), whereas the time to add to the index is much more significant and variable. There is also variable time when the memory is rolled (although the actual activity here appears to be minimal.
The refactoring here is two-fold:
- tidy and simplify by keeping LoopState managed within handle_call, and add more helpful dialyzer specs;
- change the update to the index to be a simple extension of a list, rather than any conversion.
This alternative version of the pmem index in unit test is orders of magnitude faster to add - and is the same order of magnitude to check. Anticipation is that it may be more efficient in terms of memory changes.
* Compress SST index
Reduces the size of the leveled_sst index with two changes:
1 - Where there is a common prefix of tuple elements (e.g. Bucket) across the whole leveled_sst file - only the non-common part is indexed, and a function is used to compare.
2 - There is less "indexing" of the index i.e. only 1 in 16 keys are passed into the gb_trees part instead of 1 in 4
* Immediate hibernate
Reasons for delay in hibernate were not clear.
Straight after creation the process will not be in receipt of messages (must wait for the manifest to be updated), so better to hibernate now. This also means the log PC023 provides more accurate information.
* Refactor BIC
This patch avoids the following:
- repeated replacement of the same element in the BIC (via get_kvrange), by checking presence via GET before sing SET
- Stops re-reading of all elements to discover high modified date
Also there appears to have been a bug where a missing HMD for the file is required to add to the cache. However, now the cache may be erased without erasing the HMD. This means that the cache can never be rebuilt
* Use correct size in test results
erts_debug:flat_size/1 returns size in words (i.e. 8 bytes on 64-bit CPU) not bytes
* Don't change summary record
As it is persisted as part of the file write, any change to the summary record cannot be rolled back
* Clerk to prompt L0 write
Simplifies the logic if the clerk request work for the penciller prompts L0 writes as well as Manifest changes.
The advantage now is that if the penciller memory is full, and PUT load stops, the clerk should still be able to prompt persistence. the penciller can therefore make use of dead time this way
* Add push on journal compact
If there has been a backlog, followed by a quiet period - there may be a large ledger cache left unpushed. Journal compaction events are about once per hour, so the performance overhead of a false push should be minimal, with the advantage of clearing any backlog before load starts again.
This is only relevant to riak users with very off/full batch type workloads.
* Extend tests
To more consistently trigger all overload scenarios
* Fix range keys smaller than prefix
Can't make end key an empty binary in this case, as it may be bigger than any keys within the range, but will appear to be smaller.
Unit tests and ct tests added to expose the potential issue
* Tidy-up
- Remove penciller logs which are no longer called
- Get pclerk to only wait MIN_TIMEOUT after doing work, in case there is a backlog
- Remove update_levelzero_cache function as it is unique to handle_call of push_mem, and simple enough to be inline
- Alight testutil slow offer with standard slow offer used
* Tidy-up
Remove pre-otp20 references.
Reinstate the check that the starting pid is still active, this was added to tidy up shutdown.
Resolve failure to run on otp20 due to `-if` sttaement
* Tidy up
Using null rather then {null, Key} is potentially clearer as it is not a concern what they Key is in this case, and removes a comparison step from the leveled_codec:endkey_passed/2 function.
There were issues with coverage in eunit tests as the leveled_pclerk shut down. This prompted a general tidy of leveled_pclerk (remove passing of LoopState into internal functions, and add dialyzer specs.
* Remove R16 relic
* Further testing another issue
The StartKey must always be less than or equal to the prefix when the first N characters are stripped, but this is not true of the EndKey (for the query) which does not have to be between the FirstKey and the LastKey.
If the EndKey query does not match it must be greater than the Prefix (as otherwise it would not have been greater than the FirstKey - so set to null.
* Fix unit test
Unit test had a typo - and result interpretation had a misunderstanding.
* Code and spec tidy
Also look to the cover the situation when the FirstKey is the same as the Prefix with tests.
This is, in theory, not an issue as it is the EndKey for each sublist which is indexed in leveled_tree. However, guard against it mapping to null here, just in case there are dangers lurking (note that tests will still pass without `M > N` guard in place.
* Hibernate on BIC complete
There are three situations when the BIC becomes complete:
- In a file created as part of a merge the BIS is learned in the merge
- After startup, files below L1 learn the block cache through reads that happen to read the block, eventually the while cache will be read, unless...
- Either before/after the cache is complete, it can get whiped by a timeout after a get_sqn request (e.g. as prompted by a journal compaction) ... it will then be re-filled of the back of get/get-range requests.
In all these situations we want to hibernate after the BIC is fill - to reflect the fact that the LoopState should now be relatively stable, so it is a good point to GC and rationalise location of data.
Previously on the the first base was covered. Now all three are covered through the bic_complete message.
* Test all index keys have same term
This works functionally, but is not optimised (the term is replicated in the index)
* Summaries with same index term
If the summary index all have the same index term - only the object keys need to be indexes
* Simplify case statements
We either match the pattern of <<Prefix:N, Suffix>> or the answer should be null
* OK for M == N
If M = N for the first key, it will have a suffix of <<>>. This will match (as expected) a query Start Key of the sam size, and be smaller than any query Start Key that has the same prefix.
If the query Start Key does not match the prefix - it will be null - as it must be smaller than the Prefix (as other wise the query Start Key would be bigger than the Last Key).
The constraint of M > N was introduced before the *_prefix_filter functions were checking the prefix, to avoid issues. Now the prefix is being checked, then M == N is ok.
* Simplify
Correct the test to use a binary field in the range.
To avoid further issue, only apply filter when everything is a binary() type.
* Add test for head_only mode
When leveled is used as a tictacaae key store (in parallel mode), the keys will be head_only entries. Double check they are handled as expected like object keys
* Revert previous change - must support typed buckets
Add assertion to confirm worthwhile optimisation
* Add support for configurable cache multiple (#375)
* Mas i370 patch e (#385)
Improvement to monitoring for efficiency and improved readability of logs and stats.
As part of this, where possible, tried to avoid updating loop state on READ messages in leveled processes (as was the case when tracking stats within each process).
No performance benefits found with change, but improved stats has helped discover other potential gains.
2022-12-18 20:18:03 +00:00
|
|
|
leveled_log:log(p0001, [self()]),
|
2023-11-08 09:18:01 +00:00
|
|
|
{ok,
|
|
|
|
State#state{
|
|
|
|
is_snapshot = true,
|
|
|
|
clerk = undefined,
|
|
|
|
bookie_monref = BookieMonitor,
|
|
|
|
source_penciller = SrcPenciller}};
|
2017-04-19 22:46:37 +01:00
|
|
|
{_RootPath, _Snapshot=false, _Q, _BM} ->
|
2016-09-21 18:31:42 +01:00
|
|
|
start_from_file(PCLopts)
|
|
|
|
end.
|
2016-08-15 16:43:39 +01:00
|
|
|
|
2016-08-09 16:09:29 +01:00
|
|
|
|
2017-03-13 11:54:46 +00:00
|
|
|
handle_call({push_mem, {LedgerTable, PushedIdx, MinSQN, MaxSQN}},
|
Develop 3.1 d30update (#386)
* Mas i370 patch d (#383)
* Refactor penciller memory
In high-volume tests on large key-count clusters, so significant variation in the P0031 time has been seen:
TimeBucket PatchA
a.0ms_to_1ms 18554
b.1ms_to_2ms 51778
c.2ms_to_3ms 696
d.3ms_to_5ms 220
e.5ms_to_8ms 59
f.8ms_to_13ms 40
g.13ms_to_21ms 364
h.21ms_to_34ms 277
i.34ms_to_55ms 34
j.55ms_to_89ms 17
k.89ms_to_144ms 21
l.144ms_to_233ms 31
m.233ms_to_377ms 45
n.377ms_to_610ms 52
o.610ms_to_987ms 59
p.987ms_to_1597ms 55
q.1597ms_to_2684ms 54
r.2684ms_to_4281ms 29
s.4281ms_to_6965ms 7
t.6295ms_to_11246ms 1
It is unclear why this varies so much. The time to add to the cache appears to be minimal (but perhaps there is an issue with timing points in the code), whereas the time to add to the index is much more significant and variable. There is also variable time when the memory is rolled (although the actual activity here appears to be minimal.
The refactoring here is two-fold:
- tidy and simplify by keeping LoopState managed within handle_call, and add more helpful dialyzer specs;
- change the update to the index to be a simple extension of a list, rather than any conversion.
This alternative version of the pmem index in unit test is orders of magnitude faster to add - and is the same order of magnitude to check. Anticipation is that it may be more efficient in terms of memory changes.
* Compress SST index
Reduces the size of the leveled_sst index with two changes:
1 - Where there is a common prefix of tuple elements (e.g. Bucket) across the whole leveled_sst file - only the non-common part is indexed, and a function is used to compare.
2 - There is less "indexing" of the index i.e. only 1 in 16 keys are passed into the gb_trees part instead of 1 in 4
* Immediate hibernate
Reasons for delay in hibernate were not clear.
Straight after creation the process will not be in receipt of messages (must wait for the manifest to be updated), so better to hibernate now. This also means the log PC023 provides more accurate information.
* Refactor BIC
This patch avoids the following:
- repeated replacement of the same element in the BIC (via get_kvrange), by checking presence via GET before sing SET
- Stops re-reading of all elements to discover high modified date
Also there appears to have been a bug where a missing HMD for the file is required to add to the cache. However, now the cache may be erased without erasing the HMD. This means that the cache can never be rebuilt
* Use correct size in test results
erts_debug:flat_size/1 returns size in words (i.e. 8 bytes on 64-bit CPU) not bytes
* Don't change summary record
As it is persisted as part of the file write, any change to the summary record cannot be rolled back
* Clerk to prompt L0 write
Simplifies the logic if the clerk request work for the penciller prompts L0 writes as well as Manifest changes.
The advantage now is that if the penciller memory is full, and PUT load stops, the clerk should still be able to prompt persistence. the penciller can therefore make use of dead time this way
* Add push on journal compact
If there has been a backlog, followed by a quiet period - there may be a large ledger cache left unpushed. Journal compaction events are about once per hour, so the performance overhead of a false push should be minimal, with the advantage of clearing any backlog before load starts again.
This is only relevant to riak users with very off/full batch type workloads.
* Extend tests
To more consistently trigger all overload scenarios
* Fix range keys smaller than prefix
Can't make end key an empty binary in this case, as it may be bigger than any keys within the range, but will appear to be smaller.
Unit tests and ct tests added to expose the potential issue
* Tidy-up
- Remove penciller logs which are no longer called
- Get pclerk to only wait MIN_TIMEOUT after doing work, in case there is a backlog
- Remove update_levelzero_cache function as it is unique to handle_call of push_mem, and simple enough to be inline
- Alight testutil slow offer with standard slow offer used
* Tidy-up
Remove pre-otp20 references.
Reinstate the check that the starting pid is still active, this was added to tidy up shutdown.
Resolve failure to run on otp20 due to `-if` sttaement
* Tidy up
Using null rather then {null, Key} is potentially clearer as it is not a concern what they Key is in this case, and removes a comparison step from the leveled_codec:endkey_passed/2 function.
There were issues with coverage in eunit tests as the leveled_pclerk shut down. This prompted a general tidy of leveled_pclerk (remove passing of LoopState into internal functions, and add dialyzer specs.
* Remove R16 relic
* Further testing another issue
The StartKey must always be less than or equal to the prefix when the first N characters are stripped, but this is not true of the EndKey (for the query) which does not have to be between the FirstKey and the LastKey.
If the EndKey query does not match it must be greater than the Prefix (as otherwise it would not have been greater than the FirstKey - so set to null.
* Fix unit test
Unit test had a typo - and result interpretation had a misunderstanding.
* Code and spec tidy
Also look to the cover the situation when the FirstKey is the same as the Prefix with tests.
This is, in theory, not an issue as it is the EndKey for each sublist which is indexed in leveled_tree. However, guard against it mapping to null here, just in case there are dangers lurking (note that tests will still pass without `M > N` guard in place.
* Hibernate on BIC complete
There are three situations when the BIC becomes complete:
- In a file created as part of a merge the BIS is learned in the merge
- After startup, files below L1 learn the block cache through reads that happen to read the block, eventually the while cache will be read, unless...
- Either before/after the cache is complete, it can get whiped by a timeout after a get_sqn request (e.g. as prompted by a journal compaction) ... it will then be re-filled of the back of get/get-range requests.
In all these situations we want to hibernate after the BIC is fill - to reflect the fact that the LoopState should now be relatively stable, so it is a good point to GC and rationalise location of data.
Previously on the the first base was covered. Now all three are covered through the bic_complete message.
* Test all index keys have same term
This works functionally, but is not optimised (the term is replicated in the index)
* Summaries with same index term
If the summary index all have the same index term - only the object keys need to be indexes
* Simplify case statements
We either match the pattern of <<Prefix:N, Suffix>> or the answer should be null
* OK for M == N
If M = N for the first key, it will have a suffix of <<>>. This will match (as expected) a query Start Key of the sam size, and be smaller than any query Start Key that has the same prefix.
If the query Start Key does not match the prefix - it will be null - as it must be smaller than the Prefix (as other wise the query Start Key would be bigger than the Last Key).
The constraint of M > N was introduced before the *_prefix_filter functions were checking the prefix, to avoid issues. Now the prefix is being checked, then M == N is ok.
* Simplify
Correct the test to use a binary field in the range.
To avoid further issue, only apply filter when everything is a binary() type.
* Add test for head_only mode
When leveled is used as a tictacaae key store (in parallel mode), the keys will be head_only entries. Double check they are handled as expected like object keys
* Revert previous change - must support typed buckets
Add assertion to confirm worthwhile optimisation
* Add support for configurable cache multiple (#375)
* Mas i370 patch e (#385)
Improvement to monitoring for efficiency and improved readability of logs and stats.
As part of this, where possible, tried to avoid updating loop state on READ messages in leveled processes (as was the case when tracking stats within each process).
No performance benefits found with change, but improved stats has helped discover other potential gains.
2022-12-18 20:18:03 +00:00
|
|
|
_From,
|
2016-12-11 01:02:56 +00:00
|
|
|
State=#state{is_snapshot=Snap}) when Snap == false ->
|
2016-10-27 20:56:18 +01:00
|
|
|
% The push_mem process is as follows:
|
|
|
|
%
|
Develop 3.1 d30update (#386)
* Mas i370 patch d (#383)
* Refactor penciller memory
In high-volume tests on large key-count clusters, so significant variation in the P0031 time has been seen:
TimeBucket PatchA
a.0ms_to_1ms 18554
b.1ms_to_2ms 51778
c.2ms_to_3ms 696
d.3ms_to_5ms 220
e.5ms_to_8ms 59
f.8ms_to_13ms 40
g.13ms_to_21ms 364
h.21ms_to_34ms 277
i.34ms_to_55ms 34
j.55ms_to_89ms 17
k.89ms_to_144ms 21
l.144ms_to_233ms 31
m.233ms_to_377ms 45
n.377ms_to_610ms 52
o.610ms_to_987ms 59
p.987ms_to_1597ms 55
q.1597ms_to_2684ms 54
r.2684ms_to_4281ms 29
s.4281ms_to_6965ms 7
t.6295ms_to_11246ms 1
It is unclear why this varies so much. The time to add to the cache appears to be minimal (but perhaps there is an issue with timing points in the code), whereas the time to add to the index is much more significant and variable. There is also variable time when the memory is rolled (although the actual activity here appears to be minimal.
The refactoring here is two-fold:
- tidy and simplify by keeping LoopState managed within handle_call, and add more helpful dialyzer specs;
- change the update to the index to be a simple extension of a list, rather than any conversion.
This alternative version of the pmem index in unit test is orders of magnitude faster to add - and is the same order of magnitude to check. Anticipation is that it may be more efficient in terms of memory changes.
* Compress SST index
Reduces the size of the leveled_sst index with two changes:
1 - Where there is a common prefix of tuple elements (e.g. Bucket) across the whole leveled_sst file - only the non-common part is indexed, and a function is used to compare.
2 - There is less "indexing" of the index i.e. only 1 in 16 keys are passed into the gb_trees part instead of 1 in 4
* Immediate hibernate
Reasons for delay in hibernate were not clear.
Straight after creation the process will not be in receipt of messages (must wait for the manifest to be updated), so better to hibernate now. This also means the log PC023 provides more accurate information.
* Refactor BIC
This patch avoids the following:
- repeated replacement of the same element in the BIC (via get_kvrange), by checking presence via GET before sing SET
- Stops re-reading of all elements to discover high modified date
Also there appears to have been a bug where a missing HMD for the file is required to add to the cache. However, now the cache may be erased without erasing the HMD. This means that the cache can never be rebuilt
* Use correct size in test results
erts_debug:flat_size/1 returns size in words (i.e. 8 bytes on 64-bit CPU) not bytes
* Don't change summary record
As it is persisted as part of the file write, any change to the summary record cannot be rolled back
* Clerk to prompt L0 write
Simplifies the logic if the clerk request work for the penciller prompts L0 writes as well as Manifest changes.
The advantage now is that if the penciller memory is full, and PUT load stops, the clerk should still be able to prompt persistence. the penciller can therefore make use of dead time this way
* Add push on journal compact
If there has been a backlog, followed by a quiet period - there may be a large ledger cache left unpushed. Journal compaction events are about once per hour, so the performance overhead of a false push should be minimal, with the advantage of clearing any backlog before load starts again.
This is only relevant to riak users with very off/full batch type workloads.
* Extend tests
To more consistently trigger all overload scenarios
* Fix range keys smaller than prefix
Can't make end key an empty binary in this case, as it may be bigger than any keys within the range, but will appear to be smaller.
Unit tests and ct tests added to expose the potential issue
* Tidy-up
- Remove penciller logs which are no longer called
- Get pclerk to only wait MIN_TIMEOUT after doing work, in case there is a backlog
- Remove update_levelzero_cache function as it is unique to handle_call of push_mem, and simple enough to be inline
- Alight testutil slow offer with standard slow offer used
* Tidy-up
Remove pre-otp20 references.
Reinstate the check that the starting pid is still active, this was added to tidy up shutdown.
Resolve failure to run on otp20 due to `-if` sttaement
* Tidy up
Using null rather then {null, Key} is potentially clearer as it is not a concern what they Key is in this case, and removes a comparison step from the leveled_codec:endkey_passed/2 function.
There were issues with coverage in eunit tests as the leveled_pclerk shut down. This prompted a general tidy of leveled_pclerk (remove passing of LoopState into internal functions, and add dialyzer specs.
* Remove R16 relic
* Further testing another issue
The StartKey must always be less than or equal to the prefix when the first N characters are stripped, but this is not true of the EndKey (for the query) which does not have to be between the FirstKey and the LastKey.
If the EndKey query does not match it must be greater than the Prefix (as otherwise it would not have been greater than the FirstKey - so set to null.
* Fix unit test
Unit test had a typo - and result interpretation had a misunderstanding.
* Code and spec tidy
Also look to the cover the situation when the FirstKey is the same as the Prefix with tests.
This is, in theory, not an issue as it is the EndKey for each sublist which is indexed in leveled_tree. However, guard against it mapping to null here, just in case there are dangers lurking (note that tests will still pass without `M > N` guard in place.
* Hibernate on BIC complete
There are three situations when the BIC becomes complete:
- In a file created as part of a merge the BIS is learned in the merge
- After startup, files below L1 learn the block cache through reads that happen to read the block, eventually the while cache will be read, unless...
- Either before/after the cache is complete, it can get whiped by a timeout after a get_sqn request (e.g. as prompted by a journal compaction) ... it will then be re-filled of the back of get/get-range requests.
In all these situations we want to hibernate after the BIC is fill - to reflect the fact that the LoopState should now be relatively stable, so it is a good point to GC and rationalise location of data.
Previously on the the first base was covered. Now all three are covered through the bic_complete message.
* Test all index keys have same term
This works functionally, but is not optimised (the term is replicated in the index)
* Summaries with same index term
If the summary index all have the same index term - only the object keys need to be indexes
* Simplify case statements
We either match the pattern of <<Prefix:N, Suffix>> or the answer should be null
* OK for M == N
If M = N for the first key, it will have a suffix of <<>>. This will match (as expected) a query Start Key of the sam size, and be smaller than any query Start Key that has the same prefix.
If the query Start Key does not match the prefix - it will be null - as it must be smaller than the Prefix (as other wise the query Start Key would be bigger than the Last Key).
The constraint of M > N was introduced before the *_prefix_filter functions were checking the prefix, to avoid issues. Now the prefix is being checked, then M == N is ok.
* Simplify
Correct the test to use a binary field in the range.
To avoid further issue, only apply filter when everything is a binary() type.
* Add test for head_only mode
When leveled is used as a tictacaae key store (in parallel mode), the keys will be head_only entries. Double check they are handled as expected like object keys
* Revert previous change - must support typed buckets
Add assertion to confirm worthwhile optimisation
* Add support for configurable cache multiple (#375)
* Mas i370 patch e (#385)
Improvement to monitoring for efficiency and improved readability of logs and stats.
As part of this, where possible, tried to avoid updating loop state on READ messages in leveled processes (as was the case when tracking stats within each process).
No performance benefits found with change, but improved stats has helped discover other potential gains.
2022-12-18 20:18:03 +00:00
|
|
|
% 1. If either the penciller is still waiting on the last L0 file to be
|
|
|
|
% written, or there is a work backlog - the cache is returned with the
|
|
|
|
% expectation that PUTs should be slowed. Also if the cache has reached
|
|
|
|
% the maximum number of lines (by default after 31 pushes from the bookie)
|
|
|
|
%
|
|
|
|
% 2. If (1) does not apply, the bookie's cache will be added to the
|
|
|
|
% penciller's cache.
|
|
|
|
SW = os:timestamp(),
|
|
|
|
|
|
|
|
L0Pending = State#state.levelzero_pending,
|
|
|
|
WorkBacklog = State#state.work_backlog,
|
|
|
|
CacheAlreadyFull = leveled_pmem:cache_full(State#state.levelzero_cache),
|
|
|
|
L0Size = State#state.levelzero_size,
|
|
|
|
|
|
|
|
% The clerk is prompted into action as there may be a L0 write required
|
|
|
|
ok = leveled_pclerk:clerk_prompt(State#state.clerk),
|
|
|
|
|
|
|
|
case L0Pending or WorkBacklog or CacheAlreadyFull of
|
|
|
|
true ->
|
|
|
|
% Cannot update the cache, or roll the memory so reply `returned`
|
|
|
|
% The Bookie must now retain the lesger cache and try to push the
|
|
|
|
% updated cache at a later time
|
|
|
|
leveled_log:log(
|
|
|
|
p0018,
|
|
|
|
[L0Size, L0Pending, WorkBacklog, CacheAlreadyFull]),
|
2016-11-05 13:42:44 +00:00
|
|
|
{reply, returned, State};
|
Develop 3.1 d30update (#386)
* Mas i370 patch d (#383)
* Refactor penciller memory
In high-volume tests on large key-count clusters, so significant variation in the P0031 time has been seen:
TimeBucket PatchA
a.0ms_to_1ms 18554
b.1ms_to_2ms 51778
c.2ms_to_3ms 696
d.3ms_to_5ms 220
e.5ms_to_8ms 59
f.8ms_to_13ms 40
g.13ms_to_21ms 364
h.21ms_to_34ms 277
i.34ms_to_55ms 34
j.55ms_to_89ms 17
k.89ms_to_144ms 21
l.144ms_to_233ms 31
m.233ms_to_377ms 45
n.377ms_to_610ms 52
o.610ms_to_987ms 59
p.987ms_to_1597ms 55
q.1597ms_to_2684ms 54
r.2684ms_to_4281ms 29
s.4281ms_to_6965ms 7
t.6295ms_to_11246ms 1
It is unclear why this varies so much. The time to add to the cache appears to be minimal (but perhaps there is an issue with timing points in the code), whereas the time to add to the index is much more significant and variable. There is also variable time when the memory is rolled (although the actual activity here appears to be minimal.
The refactoring here is two-fold:
- tidy and simplify by keeping LoopState managed within handle_call, and add more helpful dialyzer specs;
- change the update to the index to be a simple extension of a list, rather than any conversion.
This alternative version of the pmem index in unit test is orders of magnitude faster to add - and is the same order of magnitude to check. Anticipation is that it may be more efficient in terms of memory changes.
* Compress SST index
Reduces the size of the leveled_sst index with two changes:
1 - Where there is a common prefix of tuple elements (e.g. Bucket) across the whole leveled_sst file - only the non-common part is indexed, and a function is used to compare.
2 - There is less "indexing" of the index i.e. only 1 in 16 keys are passed into the gb_trees part instead of 1 in 4
* Immediate hibernate
Reasons for delay in hibernate were not clear.
Straight after creation the process will not be in receipt of messages (must wait for the manifest to be updated), so better to hibernate now. This also means the log PC023 provides more accurate information.
* Refactor BIC
This patch avoids the following:
- repeated replacement of the same element in the BIC (via get_kvrange), by checking presence via GET before sing SET
- Stops re-reading of all elements to discover high modified date
Also there appears to have been a bug where a missing HMD for the file is required to add to the cache. However, now the cache may be erased without erasing the HMD. This means that the cache can never be rebuilt
* Use correct size in test results
erts_debug:flat_size/1 returns size in words (i.e. 8 bytes on 64-bit CPU) not bytes
* Don't change summary record
As it is persisted as part of the file write, any change to the summary record cannot be rolled back
* Clerk to prompt L0 write
Simplifies the logic if the clerk request work for the penciller prompts L0 writes as well as Manifest changes.
The advantage now is that if the penciller memory is full, and PUT load stops, the clerk should still be able to prompt persistence. the penciller can therefore make use of dead time this way
* Add push on journal compact
If there has been a backlog, followed by a quiet period - there may be a large ledger cache left unpushed. Journal compaction events are about once per hour, so the performance overhead of a false push should be minimal, with the advantage of clearing any backlog before load starts again.
This is only relevant to riak users with very off/full batch type workloads.
* Extend tests
To more consistently trigger all overload scenarios
* Fix range keys smaller than prefix
Can't make end key an empty binary in this case, as it may be bigger than any keys within the range, but will appear to be smaller.
Unit tests and ct tests added to expose the potential issue
* Tidy-up
- Remove penciller logs which are no longer called
- Get pclerk to only wait MIN_TIMEOUT after doing work, in case there is a backlog
- Remove update_levelzero_cache function as it is unique to handle_call of push_mem, and simple enough to be inline
- Alight testutil slow offer with standard slow offer used
* Tidy-up
Remove pre-otp20 references.
Reinstate the check that the starting pid is still active, this was added to tidy up shutdown.
Resolve failure to run on otp20 due to `-if` sttaement
* Tidy up
Using null rather then {null, Key} is potentially clearer as it is not a concern what they Key is in this case, and removes a comparison step from the leveled_codec:endkey_passed/2 function.
There were issues with coverage in eunit tests as the leveled_pclerk shut down. This prompted a general tidy of leveled_pclerk (remove passing of LoopState into internal functions, and add dialyzer specs.
* Remove R16 relic
* Further testing another issue
The StartKey must always be less than or equal to the prefix when the first N characters are stripped, but this is not true of the EndKey (for the query) which does not have to be between the FirstKey and the LastKey.
If the EndKey query does not match it must be greater than the Prefix (as otherwise it would not have been greater than the FirstKey - so set to null.
* Fix unit test
Unit test had a typo - and result interpretation had a misunderstanding.
* Code and spec tidy
Also look to the cover the situation when the FirstKey is the same as the Prefix with tests.
This is, in theory, not an issue as it is the EndKey for each sublist which is indexed in leveled_tree. However, guard against it mapping to null here, just in case there are dangers lurking (note that tests will still pass without `M > N` guard in place.
* Hibernate on BIC complete
There are three situations when the BIC becomes complete:
- In a file created as part of a merge the BIS is learned in the merge
- After startup, files below L1 learn the block cache through reads that happen to read the block, eventually the while cache will be read, unless...
- Either before/after the cache is complete, it can get whiped by a timeout after a get_sqn request (e.g. as prompted by a journal compaction) ... it will then be re-filled of the back of get/get-range requests.
In all these situations we want to hibernate after the BIC is fill - to reflect the fact that the LoopState should now be relatively stable, so it is a good point to GC and rationalise location of data.
Previously on the the first base was covered. Now all three are covered through the bic_complete message.
* Test all index keys have same term
This works functionally, but is not optimised (the term is replicated in the index)
* Summaries with same index term
If the summary index all have the same index term - only the object keys need to be indexes
* Simplify case statements
We either match the pattern of <<Prefix:N, Suffix>> or the answer should be null
* OK for M == N
If M = N for the first key, it will have a suffix of <<>>. This will match (as expected) a query Start Key of the sam size, and be smaller than any query Start Key that has the same prefix.
If the query Start Key does not match the prefix - it will be null - as it must be smaller than the Prefix (as other wise the query Start Key would be bigger than the Last Key).
The constraint of M > N was introduced before the *_prefix_filter functions were checking the prefix, to avoid issues. Now the prefix is being checked, then M == N is ok.
* Simplify
Correct the test to use a binary field in the range.
To avoid further issue, only apply filter when everything is a binary() type.
* Add test for head_only mode
When leveled is used as a tictacaae key store (in parallel mode), the keys will be head_only entries. Double check they are handled as expected like object keys
* Revert previous change - must support typed buckets
Add assertion to confirm worthwhile optimisation
* Add support for configurable cache multiple (#375)
* Mas i370 patch e (#385)
Improvement to monitoring for efficiency and improved readability of logs and stats.
As part of this, where possible, tried to avoid updating loop state on READ messages in leveled processes (as was the case when tracking stats within each process).
No performance benefits found with change, but improved stats has helped discover other potential gains.
2022-12-18 20:18:03 +00:00
|
|
|
false ->
|
|
|
|
% Return ok as cache has been updated on State and the Bookie
|
|
|
|
% should clear its ledger cache which is now with the Penciller
|
2017-03-13 11:54:46 +00:00
|
|
|
PushedTree =
|
|
|
|
case is_tuple(LedgerTable) of
|
|
|
|
true ->
|
|
|
|
LedgerTable;
|
|
|
|
false ->
|
Develop 3.1 d30update (#386)
* Mas i370 patch d (#383)
* Refactor penciller memory
In high-volume tests on large key-count clusters, so significant variation in the P0031 time has been seen:
TimeBucket PatchA
a.0ms_to_1ms 18554
b.1ms_to_2ms 51778
c.2ms_to_3ms 696
d.3ms_to_5ms 220
e.5ms_to_8ms 59
f.8ms_to_13ms 40
g.13ms_to_21ms 364
h.21ms_to_34ms 277
i.34ms_to_55ms 34
j.55ms_to_89ms 17
k.89ms_to_144ms 21
l.144ms_to_233ms 31
m.233ms_to_377ms 45
n.377ms_to_610ms 52
o.610ms_to_987ms 59
p.987ms_to_1597ms 55
q.1597ms_to_2684ms 54
r.2684ms_to_4281ms 29
s.4281ms_to_6965ms 7
t.6295ms_to_11246ms 1
It is unclear why this varies so much. The time to add to the cache appears to be minimal (but perhaps there is an issue with timing points in the code), whereas the time to add to the index is much more significant and variable. There is also variable time when the memory is rolled (although the actual activity here appears to be minimal.
The refactoring here is two-fold:
- tidy and simplify by keeping LoopState managed within handle_call, and add more helpful dialyzer specs;
- change the update to the index to be a simple extension of a list, rather than any conversion.
This alternative version of the pmem index in unit test is orders of magnitude faster to add - and is the same order of magnitude to check. Anticipation is that it may be more efficient in terms of memory changes.
* Compress SST index
Reduces the size of the leveled_sst index with two changes:
1 - Where there is a common prefix of tuple elements (e.g. Bucket) across the whole leveled_sst file - only the non-common part is indexed, and a function is used to compare.
2 - There is less "indexing" of the index i.e. only 1 in 16 keys are passed into the gb_trees part instead of 1 in 4
* Immediate hibernate
Reasons for delay in hibernate were not clear.
Straight after creation the process will not be in receipt of messages (must wait for the manifest to be updated), so better to hibernate now. This also means the log PC023 provides more accurate information.
* Refactor BIC
This patch avoids the following:
- repeated replacement of the same element in the BIC (via get_kvrange), by checking presence via GET before sing SET
- Stops re-reading of all elements to discover high modified date
Also there appears to have been a bug where a missing HMD for the file is required to add to the cache. However, now the cache may be erased without erasing the HMD. This means that the cache can never be rebuilt
* Use correct size in test results
erts_debug:flat_size/1 returns size in words (i.e. 8 bytes on 64-bit CPU) not bytes
* Don't change summary record
As it is persisted as part of the file write, any change to the summary record cannot be rolled back
* Clerk to prompt L0 write
Simplifies the logic if the clerk request work for the penciller prompts L0 writes as well as Manifest changes.
The advantage now is that if the penciller memory is full, and PUT load stops, the clerk should still be able to prompt persistence. the penciller can therefore make use of dead time this way
* Add push on journal compact
If there has been a backlog, followed by a quiet period - there may be a large ledger cache left unpushed. Journal compaction events are about once per hour, so the performance overhead of a false push should be minimal, with the advantage of clearing any backlog before load starts again.
This is only relevant to riak users with very off/full batch type workloads.
* Extend tests
To more consistently trigger all overload scenarios
* Fix range keys smaller than prefix
Can't make end key an empty binary in this case, as it may be bigger than any keys within the range, but will appear to be smaller.
Unit tests and ct tests added to expose the potential issue
* Tidy-up
- Remove penciller logs which are no longer called
- Get pclerk to only wait MIN_TIMEOUT after doing work, in case there is a backlog
- Remove update_levelzero_cache function as it is unique to handle_call of push_mem, and simple enough to be inline
- Alight testutil slow offer with standard slow offer used
* Tidy-up
Remove pre-otp20 references.
Reinstate the check that the starting pid is still active, this was added to tidy up shutdown.
Resolve failure to run on otp20 due to `-if` sttaement
* Tidy up
Using null rather then {null, Key} is potentially clearer as it is not a concern what they Key is in this case, and removes a comparison step from the leveled_codec:endkey_passed/2 function.
There were issues with coverage in eunit tests as the leveled_pclerk shut down. This prompted a general tidy of leveled_pclerk (remove passing of LoopState into internal functions, and add dialyzer specs.
* Remove R16 relic
* Further testing another issue
The StartKey must always be less than or equal to the prefix when the first N characters are stripped, but this is not true of the EndKey (for the query) which does not have to be between the FirstKey and the LastKey.
If the EndKey query does not match it must be greater than the Prefix (as otherwise it would not have been greater than the FirstKey - so set to null.
* Fix unit test
Unit test had a typo - and result interpretation had a misunderstanding.
* Code and spec tidy
Also look to the cover the situation when the FirstKey is the same as the Prefix with tests.
This is, in theory, not an issue as it is the EndKey for each sublist which is indexed in leveled_tree. However, guard against it mapping to null here, just in case there are dangers lurking (note that tests will still pass without `M > N` guard in place.
* Hibernate on BIC complete
There are three situations when the BIC becomes complete:
- In a file created as part of a merge the BIS is learned in the merge
- After startup, files below L1 learn the block cache through reads that happen to read the block, eventually the while cache will be read, unless...
- Either before/after the cache is complete, it can get whiped by a timeout after a get_sqn request (e.g. as prompted by a journal compaction) ... it will then be re-filled of the back of get/get-range requests.
In all these situations we want to hibernate after the BIC is fill - to reflect the fact that the LoopState should now be relatively stable, so it is a good point to GC and rationalise location of data.
Previously on the the first base was covered. Now all three are covered through the bic_complete message.
* Test all index keys have same term
This works functionally, but is not optimised (the term is replicated in the index)
* Summaries with same index term
If the summary index all have the same index term - only the object keys need to be indexes
* Simplify case statements
We either match the pattern of <<Prefix:N, Suffix>> or the answer should be null
* OK for M == N
If M = N for the first key, it will have a suffix of <<>>. This will match (as expected) a query Start Key of the sam size, and be smaller than any query Start Key that has the same prefix.
If the query Start Key does not match the prefix - it will be null - as it must be smaller than the Prefix (as other wise the query Start Key would be bigger than the Last Key).
The constraint of M > N was introduced before the *_prefix_filter functions were checking the prefix, to avoid issues. Now the prefix is being checked, then M == N is ok.
* Simplify
Correct the test to use a binary field in the range.
To avoid further issue, only apply filter when everything is a binary() type.
* Add test for head_only mode
When leveled is used as a tictacaae key store (in parallel mode), the keys will be head_only entries. Double check they are handled as expected like object keys
* Revert previous change - must support typed buckets
Add assertion to confirm worthwhile optimisation
* Add support for configurable cache multiple (#375)
* Mas i370 patch e (#385)
Improvement to monitoring for efficiency and improved readability of logs and stats.
As part of this, where possible, tried to avoid updating loop state on READ messages in leveled processes (as was the case when tracking stats within each process).
No performance benefits found with change, but improved stats has helped discover other potential gains.
2022-12-18 20:18:03 +00:00
|
|
|
leveled_tree:from_orderedset(LedgerTable, ?CACHE_TYPE)
|
2017-03-13 11:54:46 +00:00
|
|
|
end,
|
2023-01-18 11:44:02 +00:00
|
|
|
case leveled_pmem:add_to_cache(
|
Develop 3.1 d30update (#386)
* Mas i370 patch d (#383)
* Refactor penciller memory
In high-volume tests on large key-count clusters, so significant variation in the P0031 time has been seen:
TimeBucket PatchA
a.0ms_to_1ms 18554
b.1ms_to_2ms 51778
c.2ms_to_3ms 696
d.3ms_to_5ms 220
e.5ms_to_8ms 59
f.8ms_to_13ms 40
g.13ms_to_21ms 364
h.21ms_to_34ms 277
i.34ms_to_55ms 34
j.55ms_to_89ms 17
k.89ms_to_144ms 21
l.144ms_to_233ms 31
m.233ms_to_377ms 45
n.377ms_to_610ms 52
o.610ms_to_987ms 59
p.987ms_to_1597ms 55
q.1597ms_to_2684ms 54
r.2684ms_to_4281ms 29
s.4281ms_to_6965ms 7
t.6295ms_to_11246ms 1
It is unclear why this varies so much. The time to add to the cache appears to be minimal (but perhaps there is an issue with timing points in the code), whereas the time to add to the index is much more significant and variable. There is also variable time when the memory is rolled (although the actual activity here appears to be minimal.
The refactoring here is two-fold:
- tidy and simplify by keeping LoopState managed within handle_call, and add more helpful dialyzer specs;
- change the update to the index to be a simple extension of a list, rather than any conversion.
This alternative version of the pmem index in unit test is orders of magnitude faster to add - and is the same order of magnitude to check. Anticipation is that it may be more efficient in terms of memory changes.
* Compress SST index
Reduces the size of the leveled_sst index with two changes:
1 - Where there is a common prefix of tuple elements (e.g. Bucket) across the whole leveled_sst file - only the non-common part is indexed, and a function is used to compare.
2 - There is less "indexing" of the index i.e. only 1 in 16 keys are passed into the gb_trees part instead of 1 in 4
* Immediate hibernate
Reasons for delay in hibernate were not clear.
Straight after creation the process will not be in receipt of messages (must wait for the manifest to be updated), so better to hibernate now. This also means the log PC023 provides more accurate information.
* Refactor BIC
This patch avoids the following:
- repeated replacement of the same element in the BIC (via get_kvrange), by checking presence via GET before sing SET
- Stops re-reading of all elements to discover high modified date
Also there appears to have been a bug where a missing HMD for the file is required to add to the cache. However, now the cache may be erased without erasing the HMD. This means that the cache can never be rebuilt
* Use correct size in test results
erts_debug:flat_size/1 returns size in words (i.e. 8 bytes on 64-bit CPU) not bytes
* Don't change summary record
As it is persisted as part of the file write, any change to the summary record cannot be rolled back
* Clerk to prompt L0 write
Simplifies the logic if the clerk request work for the penciller prompts L0 writes as well as Manifest changes.
The advantage now is that if the penciller memory is full, and PUT load stops, the clerk should still be able to prompt persistence. the penciller can therefore make use of dead time this way
* Add push on journal compact
If there has been a backlog, followed by a quiet period - there may be a large ledger cache left unpushed. Journal compaction events are about once per hour, so the performance overhead of a false push should be minimal, with the advantage of clearing any backlog before load starts again.
This is only relevant to riak users with very off/full batch type workloads.
* Extend tests
To more consistently trigger all overload scenarios
* Fix range keys smaller than prefix
Can't make end key an empty binary in this case, as it may be bigger than any keys within the range, but will appear to be smaller.
Unit tests and ct tests added to expose the potential issue
* Tidy-up
- Remove penciller logs which are no longer called
- Get pclerk to only wait MIN_TIMEOUT after doing work, in case there is a backlog
- Remove update_levelzero_cache function as it is unique to handle_call of push_mem, and simple enough to be inline
- Alight testutil slow offer with standard slow offer used
* Tidy-up
Remove pre-otp20 references.
Reinstate the check that the starting pid is still active, this was added to tidy up shutdown.
Resolve failure to run on otp20 due to `-if` sttaement
* Tidy up
Using null rather then {null, Key} is potentially clearer as it is not a concern what they Key is in this case, and removes a comparison step from the leveled_codec:endkey_passed/2 function.
There were issues with coverage in eunit tests as the leveled_pclerk shut down. This prompted a general tidy of leveled_pclerk (remove passing of LoopState into internal functions, and add dialyzer specs.
* Remove R16 relic
* Further testing another issue
The StartKey must always be less than or equal to the prefix when the first N characters are stripped, but this is not true of the EndKey (for the query) which does not have to be between the FirstKey and the LastKey.
If the EndKey query does not match it must be greater than the Prefix (as otherwise it would not have been greater than the FirstKey - so set to null.
* Fix unit test
Unit test had a typo - and result interpretation had a misunderstanding.
* Code and spec tidy
Also look to the cover the situation when the FirstKey is the same as the Prefix with tests.
This is, in theory, not an issue as it is the EndKey for each sublist which is indexed in leveled_tree. However, guard against it mapping to null here, just in case there are dangers lurking (note that tests will still pass without `M > N` guard in place.
* Hibernate on BIC complete
There are three situations when the BIC becomes complete:
- In a file created as part of a merge the BIS is learned in the merge
- After startup, files below L1 learn the block cache through reads that happen to read the block, eventually the while cache will be read, unless...
- Either before/after the cache is complete, it can get whiped by a timeout after a get_sqn request (e.g. as prompted by a journal compaction) ... it will then be re-filled of the back of get/get-range requests.
In all these situations we want to hibernate after the BIC is fill - to reflect the fact that the LoopState should now be relatively stable, so it is a good point to GC and rationalise location of data.
Previously on the the first base was covered. Now all three are covered through the bic_complete message.
* Test all index keys have same term
This works functionally, but is not optimised (the term is replicated in the index)
* Summaries with same index term
If the summary index all have the same index term - only the object keys need to be indexes
* Simplify case statements
We either match the pattern of <<Prefix:N, Suffix>> or the answer should be null
* OK for M == N
If M = N for the first key, it will have a suffix of <<>>. This will match (as expected) a query Start Key of the sam size, and be smaller than any query Start Key that has the same prefix.
If the query Start Key does not match the prefix - it will be null - as it must be smaller than the Prefix (as other wise the query Start Key would be bigger than the Last Key).
The constraint of M > N was introduced before the *_prefix_filter functions were checking the prefix, to avoid issues. Now the prefix is being checked, then M == N is ok.
* Simplify
Correct the test to use a binary field in the range.
To avoid further issue, only apply filter when everything is a binary() type.
* Add test for head_only mode
When leveled is used as a tictacaae key store (in parallel mode), the keys will be head_only entries. Double check they are handled as expected like object keys
* Revert previous change - must support typed buckets
Add assertion to confirm worthwhile optimisation
* Add support for configurable cache multiple (#375)
* Mas i370 patch e (#385)
Improvement to monitoring for efficiency and improved readability of logs and stats.
As part of this, where possible, tried to avoid updating loop state on READ messages in leveled processes (as was the case when tracking stats within each process).
No performance benefits found with change, but improved stats has helped discover other potential gains.
2022-12-18 20:18:03 +00:00
|
|
|
L0Size,
|
|
|
|
{PushedTree, MinSQN, MaxSQN},
|
|
|
|
State#state.ledger_sqn,
|
2023-01-18 11:44:02 +00:00
|
|
|
State#state.levelzero_cache,
|
|
|
|
true) of
|
|
|
|
empty_push ->
|
|
|
|
{reply, ok, State};
|
|
|
|
{UpdMaxSQN, NewL0Size, UpdL0Cache} ->
|
|
|
|
UpdL0Index =
|
|
|
|
leveled_pmem:add_to_index(
|
|
|
|
PushedIdx,
|
|
|
|
State#state.levelzero_index,
|
|
|
|
length(State#state.levelzero_cache) + 1),
|
|
|
|
leveled_log:log_randomtimer(
|
|
|
|
p0031,
|
|
|
|
[NewL0Size, true, true, MinSQN, MaxSQN], SW, 0.1),
|
|
|
|
{reply,
|
|
|
|
ok,
|
|
|
|
State#state{
|
|
|
|
levelzero_cache = UpdL0Cache,
|
|
|
|
levelzero_size = NewL0Size,
|
|
|
|
levelzero_index = UpdL0Index,
|
|
|
|
ledger_sqn = UpdMaxSQN}}
|
|
|
|
end
|
2016-11-05 13:42:44 +00:00
|
|
|
end;
|
2018-06-23 15:15:49 +01:00
|
|
|
handle_call({fetch, Key, Hash, UseL0Index}, _From, State) ->
|
|
|
|
L0Idx =
|
|
|
|
case UseL0Index of
|
|
|
|
true ->
|
|
|
|
State#state.levelzero_index;
|
|
|
|
false ->
|
|
|
|
none
|
|
|
|
end,
|
Develop 3.1 d30update (#386)
* Mas i370 patch d (#383)
* Refactor penciller memory
In high-volume tests on large key-count clusters, so significant variation in the P0031 time has been seen:
TimeBucket PatchA
a.0ms_to_1ms 18554
b.1ms_to_2ms 51778
c.2ms_to_3ms 696
d.3ms_to_5ms 220
e.5ms_to_8ms 59
f.8ms_to_13ms 40
g.13ms_to_21ms 364
h.21ms_to_34ms 277
i.34ms_to_55ms 34
j.55ms_to_89ms 17
k.89ms_to_144ms 21
l.144ms_to_233ms 31
m.233ms_to_377ms 45
n.377ms_to_610ms 52
o.610ms_to_987ms 59
p.987ms_to_1597ms 55
q.1597ms_to_2684ms 54
r.2684ms_to_4281ms 29
s.4281ms_to_6965ms 7
t.6295ms_to_11246ms 1
It is unclear why this varies so much. The time to add to the cache appears to be minimal (but perhaps there is an issue with timing points in the code), whereas the time to add to the index is much more significant and variable. There is also variable time when the memory is rolled (although the actual activity here appears to be minimal.
The refactoring here is two-fold:
- tidy and simplify by keeping LoopState managed within handle_call, and add more helpful dialyzer specs;
- change the update to the index to be a simple extension of a list, rather than any conversion.
This alternative version of the pmem index in unit test is orders of magnitude faster to add - and is the same order of magnitude to check. Anticipation is that it may be more efficient in terms of memory changes.
* Compress SST index
Reduces the size of the leveled_sst index with two changes:
1 - Where there is a common prefix of tuple elements (e.g. Bucket) across the whole leveled_sst file - only the non-common part is indexed, and a function is used to compare.
2 - There is less "indexing" of the index i.e. only 1 in 16 keys are passed into the gb_trees part instead of 1 in 4
* Immediate hibernate
Reasons for delay in hibernate were not clear.
Straight after creation the process will not be in receipt of messages (must wait for the manifest to be updated), so better to hibernate now. This also means the log PC023 provides more accurate information.
* Refactor BIC
This patch avoids the following:
- repeated replacement of the same element in the BIC (via get_kvrange), by checking presence via GET before sing SET
- Stops re-reading of all elements to discover high modified date
Also there appears to have been a bug where a missing HMD for the file is required to add to the cache. However, now the cache may be erased without erasing the HMD. This means that the cache can never be rebuilt
* Use correct size in test results
erts_debug:flat_size/1 returns size in words (i.e. 8 bytes on 64-bit CPU) not bytes
* Don't change summary record
As it is persisted as part of the file write, any change to the summary record cannot be rolled back
* Clerk to prompt L0 write
Simplifies the logic if the clerk request work for the penciller prompts L0 writes as well as Manifest changes.
The advantage now is that if the penciller memory is full, and PUT load stops, the clerk should still be able to prompt persistence. the penciller can therefore make use of dead time this way
* Add push on journal compact
If there has been a backlog, followed by a quiet period - there may be a large ledger cache left unpushed. Journal compaction events are about once per hour, so the performance overhead of a false push should be minimal, with the advantage of clearing any backlog before load starts again.
This is only relevant to riak users with very off/full batch type workloads.
* Extend tests
To more consistently trigger all overload scenarios
* Fix range keys smaller than prefix
Can't make end key an empty binary in this case, as it may be bigger than any keys within the range, but will appear to be smaller.
Unit tests and ct tests added to expose the potential issue
* Tidy-up
- Remove penciller logs which are no longer called
- Get pclerk to only wait MIN_TIMEOUT after doing work, in case there is a backlog
- Remove update_levelzero_cache function as it is unique to handle_call of push_mem, and simple enough to be inline
- Alight testutil slow offer with standard slow offer used
* Tidy-up
Remove pre-otp20 references.
Reinstate the check that the starting pid is still active, this was added to tidy up shutdown.
Resolve failure to run on otp20 due to `-if` sttaement
* Tidy up
Using null rather then {null, Key} is potentially clearer as it is not a concern what they Key is in this case, and removes a comparison step from the leveled_codec:endkey_passed/2 function.
There were issues with coverage in eunit tests as the leveled_pclerk shut down. This prompted a general tidy of leveled_pclerk (remove passing of LoopState into internal functions, and add dialyzer specs.
* Remove R16 relic
* Further testing another issue
The StartKey must always be less than or equal to the prefix when the first N characters are stripped, but this is not true of the EndKey (for the query) which does not have to be between the FirstKey and the LastKey.
If the EndKey query does not match it must be greater than the Prefix (as otherwise it would not have been greater than the FirstKey - so set to null.
* Fix unit test
Unit test had a typo - and result interpretation had a misunderstanding.
* Code and spec tidy
Also look to the cover the situation when the FirstKey is the same as the Prefix with tests.
This is, in theory, not an issue as it is the EndKey for each sublist which is indexed in leveled_tree. However, guard against it mapping to null here, just in case there are dangers lurking (note that tests will still pass without `M > N` guard in place.
* Hibernate on BIC complete
There are three situations when the BIC becomes complete:
- In a file created as part of a merge the BIS is learned in the merge
- After startup, files below L1 learn the block cache through reads that happen to read the block, eventually the while cache will be read, unless...
- Either before/after the cache is complete, it can get whiped by a timeout after a get_sqn request (e.g. as prompted by a journal compaction) ... it will then be re-filled of the back of get/get-range requests.
In all these situations we want to hibernate after the BIC is fill - to reflect the fact that the LoopState should now be relatively stable, so it is a good point to GC and rationalise location of data.
Previously on the the first base was covered. Now all three are covered through the bic_complete message.
* Test all index keys have same term
This works functionally, but is not optimised (the term is replicated in the index)
* Summaries with same index term
If the summary index all have the same index term - only the object keys need to be indexes
* Simplify case statements
We either match the pattern of <<Prefix:N, Suffix>> or the answer should be null
* OK for M == N
If M = N for the first key, it will have a suffix of <<>>. This will match (as expected) a query Start Key of the sam size, and be smaller than any query Start Key that has the same prefix.
If the query Start Key does not match the prefix - it will be null - as it must be smaller than the Prefix (as other wise the query Start Key would be bigger than the Last Key).
The constraint of M > N was introduced before the *_prefix_filter functions were checking the prefix, to avoid issues. Now the prefix is being checked, then M == N is ok.
* Simplify
Correct the test to use a binary field in the range.
To avoid further issue, only apply filter when everything is a binary() type.
* Add test for head_only mode
When leveled is used as a tictacaae key store (in parallel mode), the keys will be head_only entries. Double check they are handled as expected like object keys
* Revert previous change - must support typed buckets
Add assertion to confirm worthwhile optimisation
* Add support for configurable cache multiple (#375)
* Mas i370 patch e (#385)
Improvement to monitoring for efficiency and improved readability of logs and stats.
As part of this, where possible, tried to avoid updating loop state on READ messages in leveled processes (as was the case when tracking stats within each process).
No performance benefits found with change, but improved stats has helped discover other potential gains.
2022-12-18 20:18:03 +00:00
|
|
|
R =
|
|
|
|
timed_fetch_mem(
|
|
|
|
Key, Hash, State#state.manifest,
|
|
|
|
State#state.levelzero_cache, L0Idx,
|
|
|
|
State#state.monitor),
|
|
|
|
{reply, R, State};
|
2016-12-11 01:02:56 +00:00
|
|
|
handle_call({check_sqn, Key, Hash, SQN}, _From, State) ->
|
2016-10-07 10:04:48 +01:00
|
|
|
{reply,
|
Mas i370 d31 sstmemory (#373)
* Don't use fetch_cache below the page_cache level
* Don't time fetches due to SQN checks
SQN checks are all background processes
* Hibernate on SQN check
SQN check in the penciller is used for journal (all object) folds, but mainly for journal compaction. Use this to trigger hibernation where SST files stay quiet after the compaction check.
* Add catch for hibernate timeout
* Scale cache_size with level
Based on volume testing. Relatively speaking, far higher value to be gained from caches at higher levels (lower numbered levels). The cache at lower levels are proportionally much less efficient. so cache more at higher levels, where there is value, and less at lower levels where there is more cost relative to value.
* OTP 24 fix to cherry-pick
* Make minimal change to previous setup
Making significant change appears to not have had the expected positive improvement - so a more minimal change is proposed.
The assumption is that the cache only really gets used for double reads in the write path (e.g. where the application reads before a write) - and so a large cache make minimal difference, but no cache still has a downside.
* Introduce new types
* Mas i370 d30 sstmemory (#374)
* Don't time fetches due to SQN checks
SQN checks are all background processes
* Hibernate on SQN check
SQN check in the penciller is used for journal (all object) folds, but mainly for journal compaction. Use this to trigger hibernation where SST files stay quiet after the compaction check.
* Add catch for hibernate timeout
* Scale cache_size with level
Based on volume testing. Relatively speaking, far higher value to be gained from caches at higher levels (lower numbered levels). The cache at lower levels are proportionally much less efficient. so cache more at higher levels, where there is value, and less at lower levels where there is more cost relative to value.
* Make minimal change to previous setup
Making significant change appears to not have had the expected positive improvement - so a more minimal change is proposed.
The assumption is that the cache only really gets used for double reads in the write path (e.g. where the application reads before a write) - and so a large cache make minimal difference, but no cache still has a downside.
* Introduce new types
* More memory management
Clear blockindex_cache on timeout, and manually GC on pclerk after work.
* Add further garbage collection prompt
After fetching level zero, significant change in references in the penciller memory, so prompt a garbage_collect() at this point.
2022-04-23 13:38:20 +01:00
|
|
|
compare_to_sqn(
|
|
|
|
fetch_sqn(
|
|
|
|
Key,
|
|
|
|
Hash,
|
|
|
|
State#state.manifest,
|
|
|
|
State#state.levelzero_cache,
|
|
|
|
State#state.levelzero_index),
|
|
|
|
SQN),
|
2016-10-07 10:04:48 +01:00
|
|
|
State};
|
2017-10-31 23:28:35 +00:00
|
|
|
handle_call({fetch_keys,
|
|
|
|
StartKey, EndKey,
|
|
|
|
AccFun, InitAcc,
|
2018-10-31 00:09:24 +00:00
|
|
|
SegmentList, LastModRange, MaxKeys, By},
|
2016-10-12 17:12:49 +01:00
|
|
|
_From,
|
|
|
|
State=#state{snapshot_fully_loaded=Ready})
|
|
|
|
when Ready == true ->
|
2018-10-31 00:09:24 +00:00
|
|
|
LastModRange0 =
|
|
|
|
case LastModRange of
|
|
|
|
false ->
|
|
|
|
?OPEN_LASTMOD_RANGE;
|
|
|
|
R ->
|
|
|
|
R
|
|
|
|
end,
|
2017-03-06 10:17:51 +00:00
|
|
|
SW = os:timestamp(),
|
2016-11-25 14:50:13 +00:00
|
|
|
L0AsList =
|
2016-11-20 21:21:31 +00:00
|
|
|
case State#state.levelzero_astree of
|
|
|
|
undefined ->
|
|
|
|
leveled_pmem:merge_trees(StartKey,
|
|
|
|
EndKey,
|
|
|
|
State#state.levelzero_cache,
|
2017-01-21 11:38:26 +00:00
|
|
|
leveled_tree:empty(?CACHE_TYPE));
|
2016-11-25 14:50:13 +00:00
|
|
|
List ->
|
|
|
|
List
|
2016-11-20 21:21:31 +00:00
|
|
|
end,
|
2018-11-05 01:21:08 +00:00
|
|
|
FilteredL0 =
|
|
|
|
case SegmentList of
|
|
|
|
false ->
|
|
|
|
L0AsList;
|
|
|
|
_ ->
|
|
|
|
TunedList = leveled_sst:tune_seglist(SegmentList),
|
|
|
|
FilterFun =
|
|
|
|
fun(LKV) ->
|
|
|
|
CheckSeg =
|
|
|
|
leveled_sst:extract_hash(
|
|
|
|
leveled_codec:strip_to_segmentonly(LKV)),
|
2018-11-07 17:43:26 +00:00
|
|
|
leveled_sst:member_check(CheckSeg, TunedList)
|
2018-11-05 01:21:08 +00:00
|
|
|
end,
|
|
|
|
lists:filter(FilterFun, L0AsList)
|
|
|
|
end,
|
|
|
|
|
Develop 3.1 d30update (#386)
* Mas i370 patch d (#383)
* Refactor penciller memory
In high-volume tests on large key-count clusters, so significant variation in the P0031 time has been seen:
TimeBucket PatchA
a.0ms_to_1ms 18554
b.1ms_to_2ms 51778
c.2ms_to_3ms 696
d.3ms_to_5ms 220
e.5ms_to_8ms 59
f.8ms_to_13ms 40
g.13ms_to_21ms 364
h.21ms_to_34ms 277
i.34ms_to_55ms 34
j.55ms_to_89ms 17
k.89ms_to_144ms 21
l.144ms_to_233ms 31
m.233ms_to_377ms 45
n.377ms_to_610ms 52
o.610ms_to_987ms 59
p.987ms_to_1597ms 55
q.1597ms_to_2684ms 54
r.2684ms_to_4281ms 29
s.4281ms_to_6965ms 7
t.6295ms_to_11246ms 1
It is unclear why this varies so much. The time to add to the cache appears to be minimal (but perhaps there is an issue with timing points in the code), whereas the time to add to the index is much more significant and variable. There is also variable time when the memory is rolled (although the actual activity here appears to be minimal.
The refactoring here is two-fold:
- tidy and simplify by keeping LoopState managed within handle_call, and add more helpful dialyzer specs;
- change the update to the index to be a simple extension of a list, rather than any conversion.
This alternative version of the pmem index in unit test is orders of magnitude faster to add - and is the same order of magnitude to check. Anticipation is that it may be more efficient in terms of memory changes.
* Compress SST index
Reduces the size of the leveled_sst index with two changes:
1 - Where there is a common prefix of tuple elements (e.g. Bucket) across the whole leveled_sst file - only the non-common part is indexed, and a function is used to compare.
2 - There is less "indexing" of the index i.e. only 1 in 16 keys are passed into the gb_trees part instead of 1 in 4
* Immediate hibernate
Reasons for delay in hibernate were not clear.
Straight after creation the process will not be in receipt of messages (must wait for the manifest to be updated), so better to hibernate now. This also means the log PC023 provides more accurate information.
* Refactor BIC
This patch avoids the following:
- repeated replacement of the same element in the BIC (via get_kvrange), by checking presence via GET before sing SET
- Stops re-reading of all elements to discover high modified date
Also there appears to have been a bug where a missing HMD for the file is required to add to the cache. However, now the cache may be erased without erasing the HMD. This means that the cache can never be rebuilt
* Use correct size in test results
erts_debug:flat_size/1 returns size in words (i.e. 8 bytes on 64-bit CPU) not bytes
* Don't change summary record
As it is persisted as part of the file write, any change to the summary record cannot be rolled back
* Clerk to prompt L0 write
Simplifies the logic if the clerk request work for the penciller prompts L0 writes as well as Manifest changes.
The advantage now is that if the penciller memory is full, and PUT load stops, the clerk should still be able to prompt persistence. the penciller can therefore make use of dead time this way
* Add push on journal compact
If there has been a backlog, followed by a quiet period - there may be a large ledger cache left unpushed. Journal compaction events are about once per hour, so the performance overhead of a false push should be minimal, with the advantage of clearing any backlog before load starts again.
This is only relevant to riak users with very off/full batch type workloads.
* Extend tests
To more consistently trigger all overload scenarios
* Fix range keys smaller than prefix
Can't make end key an empty binary in this case, as it may be bigger than any keys within the range, but will appear to be smaller.
Unit tests and ct tests added to expose the potential issue
* Tidy-up
- Remove penciller logs which are no longer called
- Get pclerk to only wait MIN_TIMEOUT after doing work, in case there is a backlog
- Remove update_levelzero_cache function as it is unique to handle_call of push_mem, and simple enough to be inline
- Alight testutil slow offer with standard slow offer used
* Tidy-up
Remove pre-otp20 references.
Reinstate the check that the starting pid is still active, this was added to tidy up shutdown.
Resolve failure to run on otp20 due to `-if` sttaement
* Tidy up
Using null rather then {null, Key} is potentially clearer as it is not a concern what they Key is in this case, and removes a comparison step from the leveled_codec:endkey_passed/2 function.
There were issues with coverage in eunit tests as the leveled_pclerk shut down. This prompted a general tidy of leveled_pclerk (remove passing of LoopState into internal functions, and add dialyzer specs.
* Remove R16 relic
* Further testing another issue
The StartKey must always be less than or equal to the prefix when the first N characters are stripped, but this is not true of the EndKey (for the query) which does not have to be between the FirstKey and the LastKey.
If the EndKey query does not match it must be greater than the Prefix (as otherwise it would not have been greater than the FirstKey - so set to null.
* Fix unit test
Unit test had a typo - and result interpretation had a misunderstanding.
* Code and spec tidy
Also look to the cover the situation when the FirstKey is the same as the Prefix with tests.
This is, in theory, not an issue as it is the EndKey for each sublist which is indexed in leveled_tree. However, guard against it mapping to null here, just in case there are dangers lurking (note that tests will still pass without `M > N` guard in place.
* Hibernate on BIC complete
There are three situations when the BIC becomes complete:
- In a file created as part of a merge the BIS is learned in the merge
- After startup, files below L1 learn the block cache through reads that happen to read the block, eventually the while cache will be read, unless...
- Either before/after the cache is complete, it can get whiped by a timeout after a get_sqn request (e.g. as prompted by a journal compaction) ... it will then be re-filled of the back of get/get-range requests.
In all these situations we want to hibernate after the BIC is fill - to reflect the fact that the LoopState should now be relatively stable, so it is a good point to GC and rationalise location of data.
Previously on the the first base was covered. Now all three are covered through the bic_complete message.
* Test all index keys have same term
This works functionally, but is not optimised (the term is replicated in the index)
* Summaries with same index term
If the summary index all have the same index term - only the object keys need to be indexes
* Simplify case statements
We either match the pattern of <<Prefix:N, Suffix>> or the answer should be null
* OK for M == N
If M = N for the first key, it will have a suffix of <<>>. This will match (as expected) a query Start Key of the sam size, and be smaller than any query Start Key that has the same prefix.
If the query Start Key does not match the prefix - it will be null - as it must be smaller than the Prefix (as other wise the query Start Key would be bigger than the Last Key).
The constraint of M > N was introduced before the *_prefix_filter functions were checking the prefix, to avoid issues. Now the prefix is being checked, then M == N is ok.
* Simplify
Correct the test to use a binary field in the range.
To avoid further issue, only apply filter when everything is a binary() type.
* Add test for head_only mode
When leveled is used as a tictacaae key store (in parallel mode), the keys will be head_only entries. Double check they are handled as expected like object keys
* Revert previous change - must support typed buckets
Add assertion to confirm worthwhile optimisation
* Add support for configurable cache multiple (#375)
* Mas i370 patch e (#385)
Improvement to monitoring for efficiency and improved readability of logs and stats.
As part of this, where possible, tried to avoid updating loop state on READ messages in leveled processes (as was the case when tracking stats within each process).
No performance benefits found with change, but improved stats has helped discover other potential gains.
2022-12-18 20:18:03 +00:00
|
|
|
leveled_log:log_randomtimer(
|
|
|
|
p0037, [State#state.levelzero_size], SW, 0.01),
|
2021-01-11 10:39:34 +00:00
|
|
|
|
|
|
|
%% Rename any reference to loop state that may be used by the function
|
|
|
|
%% to be returned - https://github.com/martinsumner/leveled/issues/326
|
2022-10-11 13:45:55 +01:00
|
|
|
SSTiter =
|
|
|
|
case State#state.query_manifest of
|
|
|
|
undefined ->
|
|
|
|
leveled_pmanifest:query_manifest(
|
|
|
|
State#state.manifest, StartKey, EndKey);
|
|
|
|
{QueryManifest, StartKeyQM, EndKeyQM}
|
|
|
|
when StartKey >= StartKeyQM, EndKey =< EndKeyQM ->
|
|
|
|
QueryManifest
|
|
|
|
end,
|
2021-01-11 10:39:34 +00:00
|
|
|
SnapshotTime = State#state.snapshot_time,
|
|
|
|
|
2018-09-24 19:54:28 +01:00
|
|
|
Folder =
|
|
|
|
fun() ->
|
2018-11-05 01:21:08 +00:00
|
|
|
keyfolder({FilteredL0, SSTiter},
|
2016-11-20 21:21:31 +00:00
|
|
|
{StartKey, EndKey},
|
2021-01-11 10:39:34 +00:00
|
|
|
{AccFun, InitAcc, SnapshotTime},
|
2018-10-31 00:09:24 +00:00
|
|
|
{SegmentList, LastModRange0, MaxKeys})
|
2018-09-24 19:54:28 +01:00
|
|
|
end,
|
|
|
|
case By of
|
|
|
|
as_pcl ->
|
|
|
|
{reply, Folder(), State};
|
|
|
|
by_runner ->
|
|
|
|
{reply, Folder, State}
|
|
|
|
end;
|
2016-09-26 10:55:08 +01:00
|
|
|
handle_call(get_startup_sqn, _From, State) ->
|
2016-10-30 18:25:30 +00:00
|
|
|
{reply, State#state.persisted_sqn, State};
|
2018-12-14 11:23:04 +00:00
|
|
|
handle_call({register_snapshot, Snapshot, Query, BookiesMem, LongRunning},
|
|
|
|
_From, State) ->
|
2017-03-06 18:42:32 +00:00
|
|
|
% Register and load a snapshot
|
|
|
|
%
|
|
|
|
% For setup of the snapshot to be efficient should pass a query
|
|
|
|
% of (StartKey, EndKey) - this will avoid a fully copy of the penciller's
|
|
|
|
% memory being required to be trasnferred to the clone. However, this
|
|
|
|
% will not be a valid clone for fetch
|
2018-12-14 11:23:04 +00:00
|
|
|
|
|
|
|
TimeO =
|
|
|
|
case LongRunning of
|
2017-04-05 09:16:01 +01:00
|
|
|
true ->
|
2018-12-14 11:23:04 +00:00
|
|
|
State#state.snaptimeout_long;
|
2017-04-05 09:16:01 +01:00
|
|
|
false ->
|
2018-12-14 11:23:04 +00:00
|
|
|
State#state.snaptimeout_short
|
2017-04-05 09:16:01 +01:00
|
|
|
end,
|
2018-12-14 11:23:04 +00:00
|
|
|
Manifest0 =
|
|
|
|
leveled_pmanifest:add_snapshot(State#state.manifest, Snapshot, TimeO),
|
2017-04-05 09:16:01 +01:00
|
|
|
|
2017-03-06 18:42:32 +00:00
|
|
|
{BookieIncrTree, BookieIdx, MinSQN, MaxSQN} = BookiesMem,
|
|
|
|
LM1Cache =
|
2017-03-02 17:49:43 +00:00
|
|
|
case BookieIncrTree of
|
|
|
|
empty_cache ->
|
2017-03-06 18:42:32 +00:00
|
|
|
leveled_tree:empty(?CACHE_TYPE);
|
2017-03-02 17:49:43 +00:00
|
|
|
_ ->
|
2017-03-06 18:42:32 +00:00
|
|
|
BookieIncrTree
|
2017-03-02 17:49:43 +00:00
|
|
|
end,
|
2017-03-06 18:42:32 +00:00
|
|
|
|
2022-10-11 13:45:55 +01:00
|
|
|
{CloneState, ManifestClone, QueryManifest} =
|
2017-03-06 18:42:32 +00:00
|
|
|
case Query of
|
|
|
|
no_lookup ->
|
|
|
|
{UpdMaxSQN, UpdSize, L0Cache} =
|
2023-01-18 11:44:02 +00:00
|
|
|
leveled_pmem:add_to_cache(
|
|
|
|
State#state.levelzero_size,
|
|
|
|
{LM1Cache, MinSQN, MaxSQN},
|
|
|
|
State#state.ledger_sqn,
|
|
|
|
State#state.levelzero_cache,
|
|
|
|
false),
|
2022-10-11 13:45:55 +01:00
|
|
|
{#state{levelzero_cache = L0Cache,
|
2017-03-06 18:42:32 +00:00
|
|
|
ledger_sqn = UpdMaxSQN,
|
|
|
|
levelzero_size = UpdSize,
|
2022-10-11 13:45:55 +01:00
|
|
|
persisted_sqn = State#state.persisted_sqn},
|
|
|
|
leveled_pmanifest:copy_manifest(State#state.manifest),
|
|
|
|
undefined};
|
2017-03-06 18:42:32 +00:00
|
|
|
{StartKey, EndKey} ->
|
|
|
|
SW = os:timestamp(),
|
|
|
|
L0AsTree =
|
|
|
|
leveled_pmem:merge_trees(StartKey,
|
|
|
|
EndKey,
|
|
|
|
State#state.levelzero_cache,
|
|
|
|
LM1Cache),
|
Develop 3.1 d30update (#386)
* Mas i370 patch d (#383)
* Refactor penciller memory
In high-volume tests on large key-count clusters, so significant variation in the P0031 time has been seen:
TimeBucket PatchA
a.0ms_to_1ms 18554
b.1ms_to_2ms 51778
c.2ms_to_3ms 696
d.3ms_to_5ms 220
e.5ms_to_8ms 59
f.8ms_to_13ms 40
g.13ms_to_21ms 364
h.21ms_to_34ms 277
i.34ms_to_55ms 34
j.55ms_to_89ms 17
k.89ms_to_144ms 21
l.144ms_to_233ms 31
m.233ms_to_377ms 45
n.377ms_to_610ms 52
o.610ms_to_987ms 59
p.987ms_to_1597ms 55
q.1597ms_to_2684ms 54
r.2684ms_to_4281ms 29
s.4281ms_to_6965ms 7
t.6295ms_to_11246ms 1
It is unclear why this varies so much. The time to add to the cache appears to be minimal (but perhaps there is an issue with timing points in the code), whereas the time to add to the index is much more significant and variable. There is also variable time when the memory is rolled (although the actual activity here appears to be minimal.
The refactoring here is two-fold:
- tidy and simplify by keeping LoopState managed within handle_call, and add more helpful dialyzer specs;
- change the update to the index to be a simple extension of a list, rather than any conversion.
This alternative version of the pmem index in unit test is orders of magnitude faster to add - and is the same order of magnitude to check. Anticipation is that it may be more efficient in terms of memory changes.
* Compress SST index
Reduces the size of the leveled_sst index with two changes:
1 - Where there is a common prefix of tuple elements (e.g. Bucket) across the whole leveled_sst file - only the non-common part is indexed, and a function is used to compare.
2 - There is less "indexing" of the index i.e. only 1 in 16 keys are passed into the gb_trees part instead of 1 in 4
* Immediate hibernate
Reasons for delay in hibernate were not clear.
Straight after creation the process will not be in receipt of messages (must wait for the manifest to be updated), so better to hibernate now. This also means the log PC023 provides more accurate information.
* Refactor BIC
This patch avoids the following:
- repeated replacement of the same element in the BIC (via get_kvrange), by checking presence via GET before sing SET
- Stops re-reading of all elements to discover high modified date
Also there appears to have been a bug where a missing HMD for the file is required to add to the cache. However, now the cache may be erased without erasing the HMD. This means that the cache can never be rebuilt
* Use correct size in test results
erts_debug:flat_size/1 returns size in words (i.e. 8 bytes on 64-bit CPU) not bytes
* Don't change summary record
As it is persisted as part of the file write, any change to the summary record cannot be rolled back
* Clerk to prompt L0 write
Simplifies the logic if the clerk request work for the penciller prompts L0 writes as well as Manifest changes.
The advantage now is that if the penciller memory is full, and PUT load stops, the clerk should still be able to prompt persistence. the penciller can therefore make use of dead time this way
* Add push on journal compact
If there has been a backlog, followed by a quiet period - there may be a large ledger cache left unpushed. Journal compaction events are about once per hour, so the performance overhead of a false push should be minimal, with the advantage of clearing any backlog before load starts again.
This is only relevant to riak users with very off/full batch type workloads.
* Extend tests
To more consistently trigger all overload scenarios
* Fix range keys smaller than prefix
Can't make end key an empty binary in this case, as it may be bigger than any keys within the range, but will appear to be smaller.
Unit tests and ct tests added to expose the potential issue
* Tidy-up
- Remove penciller logs which are no longer called
- Get pclerk to only wait MIN_TIMEOUT after doing work, in case there is a backlog
- Remove update_levelzero_cache function as it is unique to handle_call of push_mem, and simple enough to be inline
- Alight testutil slow offer with standard slow offer used
* Tidy-up
Remove pre-otp20 references.
Reinstate the check that the starting pid is still active, this was added to tidy up shutdown.
Resolve failure to run on otp20 due to `-if` sttaement
* Tidy up
Using null rather then {null, Key} is potentially clearer as it is not a concern what they Key is in this case, and removes a comparison step from the leveled_codec:endkey_passed/2 function.
There were issues with coverage in eunit tests as the leveled_pclerk shut down. This prompted a general tidy of leveled_pclerk (remove passing of LoopState into internal functions, and add dialyzer specs.
* Remove R16 relic
* Further testing another issue
The StartKey must always be less than or equal to the prefix when the first N characters are stripped, but this is not true of the EndKey (for the query) which does not have to be between the FirstKey and the LastKey.
If the EndKey query does not match it must be greater than the Prefix (as otherwise it would not have been greater than the FirstKey - so set to null.
* Fix unit test
Unit test had a typo - and result interpretation had a misunderstanding.
* Code and spec tidy
Also look to the cover the situation when the FirstKey is the same as the Prefix with tests.
This is, in theory, not an issue as it is the EndKey for each sublist which is indexed in leveled_tree. However, guard against it mapping to null here, just in case there are dangers lurking (note that tests will still pass without `M > N` guard in place.
* Hibernate on BIC complete
There are three situations when the BIC becomes complete:
- In a file created as part of a merge the BIS is learned in the merge
- After startup, files below L1 learn the block cache through reads that happen to read the block, eventually the while cache will be read, unless...
- Either before/after the cache is complete, it can get whiped by a timeout after a get_sqn request (e.g. as prompted by a journal compaction) ... it will then be re-filled of the back of get/get-range requests.
In all these situations we want to hibernate after the BIC is fill - to reflect the fact that the LoopState should now be relatively stable, so it is a good point to GC and rationalise location of data.
Previously on the the first base was covered. Now all three are covered through the bic_complete message.
* Test all index keys have same term
This works functionally, but is not optimised (the term is replicated in the index)
* Summaries with same index term
If the summary index all have the same index term - only the object keys need to be indexes
* Simplify case statements
We either match the pattern of <<Prefix:N, Suffix>> or the answer should be null
* OK for M == N
If M = N for the first key, it will have a suffix of <<>>. This will match (as expected) a query Start Key of the sam size, and be smaller than any query Start Key that has the same prefix.
If the query Start Key does not match the prefix - it will be null - as it must be smaller than the Prefix (as other wise the query Start Key would be bigger than the Last Key).
The constraint of M > N was introduced before the *_prefix_filter functions were checking the prefix, to avoid issues. Now the prefix is being checked, then M == N is ok.
* Simplify
Correct the test to use a binary field in the range.
To avoid further issue, only apply filter when everything is a binary() type.
* Add test for head_only mode
When leveled is used as a tictacaae key store (in parallel mode), the keys will be head_only entries. Double check they are handled as expected like object keys
* Revert previous change - must support typed buckets
Add assertion to confirm worthwhile optimisation
* Add support for configurable cache multiple (#375)
* Mas i370 patch e (#385)
Improvement to monitoring for efficiency and improved readability of logs and stats.
As part of this, where possible, tried to avoid updating loop state on READ messages in leveled processes (as was the case when tracking stats within each process).
No performance benefits found with change, but improved stats has helped discover other potential gains.
2022-12-18 20:18:03 +00:00
|
|
|
leveled_log:log_randomtimer(
|
|
|
|
p0037, [State#state.levelzero_size], SW, 0.01),
|
2022-10-11 13:45:55 +01:00
|
|
|
{#state{levelzero_astree = L0AsTree,
|
2017-03-06 18:42:32 +00:00
|
|
|
ledger_sqn = MaxSQN,
|
2022-10-11 13:45:55 +01:00
|
|
|
persisted_sqn = State#state.persisted_sqn},
|
|
|
|
undefined,
|
|
|
|
{leveled_pmanifest:query_manifest(
|
|
|
|
State#state.manifest, StartKey, EndKey),
|
|
|
|
StartKey,
|
|
|
|
EndKey}};
|
2017-03-06 18:42:32 +00:00
|
|
|
undefined ->
|
|
|
|
{UpdMaxSQN, UpdSize, L0Cache} =
|
2023-01-18 11:44:02 +00:00
|
|
|
leveled_pmem:add_to_cache(
|
|
|
|
State#state.levelzero_size,
|
|
|
|
{LM1Cache, MinSQN, MaxSQN},
|
|
|
|
State#state.ledger_sqn,
|
|
|
|
State#state.levelzero_cache,
|
|
|
|
false),
|
Develop 3.1 d30update (#386)
* Mas i370 patch d (#383)
* Refactor penciller memory
In high-volume tests on large key-count clusters, so significant variation in the P0031 time has been seen:
TimeBucket PatchA
a.0ms_to_1ms 18554
b.1ms_to_2ms 51778
c.2ms_to_3ms 696
d.3ms_to_5ms 220
e.5ms_to_8ms 59
f.8ms_to_13ms 40
g.13ms_to_21ms 364
h.21ms_to_34ms 277
i.34ms_to_55ms 34
j.55ms_to_89ms 17
k.89ms_to_144ms 21
l.144ms_to_233ms 31
m.233ms_to_377ms 45
n.377ms_to_610ms 52
o.610ms_to_987ms 59
p.987ms_to_1597ms 55
q.1597ms_to_2684ms 54
r.2684ms_to_4281ms 29
s.4281ms_to_6965ms 7
t.6295ms_to_11246ms 1
It is unclear why this varies so much. The time to add to the cache appears to be minimal (but perhaps there is an issue with timing points in the code), whereas the time to add to the index is much more significant and variable. There is also variable time when the memory is rolled (although the actual activity here appears to be minimal.
The refactoring here is two-fold:
- tidy and simplify by keeping LoopState managed within handle_call, and add more helpful dialyzer specs;
- change the update to the index to be a simple extension of a list, rather than any conversion.
This alternative version of the pmem index in unit test is orders of magnitude faster to add - and is the same order of magnitude to check. Anticipation is that it may be more efficient in terms of memory changes.
* Compress SST index
Reduces the size of the leveled_sst index with two changes:
1 - Where there is a common prefix of tuple elements (e.g. Bucket) across the whole leveled_sst file - only the non-common part is indexed, and a function is used to compare.
2 - There is less "indexing" of the index i.e. only 1 in 16 keys are passed into the gb_trees part instead of 1 in 4
* Immediate hibernate
Reasons for delay in hibernate were not clear.
Straight after creation the process will not be in receipt of messages (must wait for the manifest to be updated), so better to hibernate now. This also means the log PC023 provides more accurate information.
* Refactor BIC
This patch avoids the following:
- repeated replacement of the same element in the BIC (via get_kvrange), by checking presence via GET before sing SET
- Stops re-reading of all elements to discover high modified date
Also there appears to have been a bug where a missing HMD for the file is required to add to the cache. However, now the cache may be erased without erasing the HMD. This means that the cache can never be rebuilt
* Use correct size in test results
erts_debug:flat_size/1 returns size in words (i.e. 8 bytes on 64-bit CPU) not bytes
* Don't change summary record
As it is persisted as part of the file write, any change to the summary record cannot be rolled back
* Clerk to prompt L0 write
Simplifies the logic if the clerk request work for the penciller prompts L0 writes as well as Manifest changes.
The advantage now is that if the penciller memory is full, and PUT load stops, the clerk should still be able to prompt persistence. the penciller can therefore make use of dead time this way
* Add push on journal compact
If there has been a backlog, followed by a quiet period - there may be a large ledger cache left unpushed. Journal compaction events are about once per hour, so the performance overhead of a false push should be minimal, with the advantage of clearing any backlog before load starts again.
This is only relevant to riak users with very off/full batch type workloads.
* Extend tests
To more consistently trigger all overload scenarios
* Fix range keys smaller than prefix
Can't make end key an empty binary in this case, as it may be bigger than any keys within the range, but will appear to be smaller.
Unit tests and ct tests added to expose the potential issue
* Tidy-up
- Remove penciller logs which are no longer called
- Get pclerk to only wait MIN_TIMEOUT after doing work, in case there is a backlog
- Remove update_levelzero_cache function as it is unique to handle_call of push_mem, and simple enough to be inline
- Alight testutil slow offer with standard slow offer used
* Tidy-up
Remove pre-otp20 references.
Reinstate the check that the starting pid is still active, this was added to tidy up shutdown.
Resolve failure to run on otp20 due to `-if` sttaement
* Tidy up
Using null rather then {null, Key} is potentially clearer as it is not a concern what they Key is in this case, and removes a comparison step from the leveled_codec:endkey_passed/2 function.
There were issues with coverage in eunit tests as the leveled_pclerk shut down. This prompted a general tidy of leveled_pclerk (remove passing of LoopState into internal functions, and add dialyzer specs.
* Remove R16 relic
* Further testing another issue
The StartKey must always be less than or equal to the prefix when the first N characters are stripped, but this is not true of the EndKey (for the query) which does not have to be between the FirstKey and the LastKey.
If the EndKey query does not match it must be greater than the Prefix (as otherwise it would not have been greater than the FirstKey - so set to null.
* Fix unit test
Unit test had a typo - and result interpretation had a misunderstanding.
* Code and spec tidy
Also look to the cover the situation when the FirstKey is the same as the Prefix with tests.
This is, in theory, not an issue as it is the EndKey for each sublist which is indexed in leveled_tree. However, guard against it mapping to null here, just in case there are dangers lurking (note that tests will still pass without `M > N` guard in place.
* Hibernate on BIC complete
There are three situations when the BIC becomes complete:
- In a file created as part of a merge the BIS is learned in the merge
- After startup, files below L1 learn the block cache through reads that happen to read the block, eventually the while cache will be read, unless...
- Either before/after the cache is complete, it can get whiped by a timeout after a get_sqn request (e.g. as prompted by a journal compaction) ... it will then be re-filled of the back of get/get-range requests.
In all these situations we want to hibernate after the BIC is fill - to reflect the fact that the LoopState should now be relatively stable, so it is a good point to GC and rationalise location of data.
Previously on the the first base was covered. Now all three are covered through the bic_complete message.
* Test all index keys have same term
This works functionally, but is not optimised (the term is replicated in the index)
* Summaries with same index term
If the summary index all have the same index term - only the object keys need to be indexes
* Simplify case statements
We either match the pattern of <<Prefix:N, Suffix>> or the answer should be null
* OK for M == N
If M = N for the first key, it will have a suffix of <<>>. This will match (as expected) a query Start Key of the sam size, and be smaller than any query Start Key that has the same prefix.
If the query Start Key does not match the prefix - it will be null - as it must be smaller than the Prefix (as other wise the query Start Key would be bigger than the Last Key).
The constraint of M > N was introduced before the *_prefix_filter functions were checking the prefix, to avoid issues. Now the prefix is being checked, then M == N is ok.
* Simplify
Correct the test to use a binary field in the range.
To avoid further issue, only apply filter when everything is a binary() type.
* Add test for head_only mode
When leveled is used as a tictacaae key store (in parallel mode), the keys will be head_only entries. Double check they are handled as expected like object keys
* Revert previous change - must support typed buckets
Add assertion to confirm worthwhile optimisation
* Add support for configurable cache multiple (#375)
* Mas i370 patch e (#385)
Improvement to monitoring for efficiency and improved readability of logs and stats.
As part of this, where possible, tried to avoid updating loop state on READ messages in leveled processes (as was the case when tracking stats within each process).
No performance benefits found with change, but improved stats has helped discover other potential gains.
2022-12-18 20:18:03 +00:00
|
|
|
LM1Idx =
|
2017-03-06 18:42:32 +00:00
|
|
|
case BookieIdx of
|
|
|
|
empty_index ->
|
Develop 3.1 d30update (#386)
* Mas i370 patch d (#383)
* Refactor penciller memory
In high-volume tests on large key-count clusters, so significant variation in the P0031 time has been seen:
TimeBucket PatchA
a.0ms_to_1ms 18554
b.1ms_to_2ms 51778
c.2ms_to_3ms 696
d.3ms_to_5ms 220
e.5ms_to_8ms 59
f.8ms_to_13ms 40
g.13ms_to_21ms 364
h.21ms_to_34ms 277
i.34ms_to_55ms 34
j.55ms_to_89ms 17
k.89ms_to_144ms 21
l.144ms_to_233ms 31
m.233ms_to_377ms 45
n.377ms_to_610ms 52
o.610ms_to_987ms 59
p.987ms_to_1597ms 55
q.1597ms_to_2684ms 54
r.2684ms_to_4281ms 29
s.4281ms_to_6965ms 7
t.6295ms_to_11246ms 1
It is unclear why this varies so much. The time to add to the cache appears to be minimal (but perhaps there is an issue with timing points in the code), whereas the time to add to the index is much more significant and variable. There is also variable time when the memory is rolled (although the actual activity here appears to be minimal.
The refactoring here is two-fold:
- tidy and simplify by keeping LoopState managed within handle_call, and add more helpful dialyzer specs;
- change the update to the index to be a simple extension of a list, rather than any conversion.
This alternative version of the pmem index in unit test is orders of magnitude faster to add - and is the same order of magnitude to check. Anticipation is that it may be more efficient in terms of memory changes.
* Compress SST index
Reduces the size of the leveled_sst index with two changes:
1 - Where there is a common prefix of tuple elements (e.g. Bucket) across the whole leveled_sst file - only the non-common part is indexed, and a function is used to compare.
2 - There is less "indexing" of the index i.e. only 1 in 16 keys are passed into the gb_trees part instead of 1 in 4
* Immediate hibernate
Reasons for delay in hibernate were not clear.
Straight after creation the process will not be in receipt of messages (must wait for the manifest to be updated), so better to hibernate now. This also means the log PC023 provides more accurate information.
* Refactor BIC
This patch avoids the following:
- repeated replacement of the same element in the BIC (via get_kvrange), by checking presence via GET before sing SET
- Stops re-reading of all elements to discover high modified date
Also there appears to have been a bug where a missing HMD for the file is required to add to the cache. However, now the cache may be erased without erasing the HMD. This means that the cache can never be rebuilt
* Use correct size in test results
erts_debug:flat_size/1 returns size in words (i.e. 8 bytes on 64-bit CPU) not bytes
* Don't change summary record
As it is persisted as part of the file write, any change to the summary record cannot be rolled back
* Clerk to prompt L0 write
Simplifies the logic if the clerk request work for the penciller prompts L0 writes as well as Manifest changes.
The advantage now is that if the penciller memory is full, and PUT load stops, the clerk should still be able to prompt persistence. the penciller can therefore make use of dead time this way
* Add push on journal compact
If there has been a backlog, followed by a quiet period - there may be a large ledger cache left unpushed. Journal compaction events are about once per hour, so the performance overhead of a false push should be minimal, with the advantage of clearing any backlog before load starts again.
This is only relevant to riak users with very off/full batch type workloads.
* Extend tests
To more consistently trigger all overload scenarios
* Fix range keys smaller than prefix
Can't make end key an empty binary in this case, as it may be bigger than any keys within the range, but will appear to be smaller.
Unit tests and ct tests added to expose the potential issue
* Tidy-up
- Remove penciller logs which are no longer called
- Get pclerk to only wait MIN_TIMEOUT after doing work, in case there is a backlog
- Remove update_levelzero_cache function as it is unique to handle_call of push_mem, and simple enough to be inline
- Alight testutil slow offer with standard slow offer used
* Tidy-up
Remove pre-otp20 references.
Reinstate the check that the starting pid is still active, this was added to tidy up shutdown.
Resolve failure to run on otp20 due to `-if` sttaement
* Tidy up
Using null rather then {null, Key} is potentially clearer as it is not a concern what they Key is in this case, and removes a comparison step from the leveled_codec:endkey_passed/2 function.
There were issues with coverage in eunit tests as the leveled_pclerk shut down. This prompted a general tidy of leveled_pclerk (remove passing of LoopState into internal functions, and add dialyzer specs.
* Remove R16 relic
* Further testing another issue
The StartKey must always be less than or equal to the prefix when the first N characters are stripped, but this is not true of the EndKey (for the query) which does not have to be between the FirstKey and the LastKey.
If the EndKey query does not match it must be greater than the Prefix (as otherwise it would not have been greater than the FirstKey - so set to null.
* Fix unit test
Unit test had a typo - and result interpretation had a misunderstanding.
* Code and spec tidy
Also look to the cover the situation when the FirstKey is the same as the Prefix with tests.
This is, in theory, not an issue as it is the EndKey for each sublist which is indexed in leveled_tree. However, guard against it mapping to null here, just in case there are dangers lurking (note that tests will still pass without `M > N` guard in place.
* Hibernate on BIC complete
There are three situations when the BIC becomes complete:
- In a file created as part of a merge the BIS is learned in the merge
- After startup, files below L1 learn the block cache through reads that happen to read the block, eventually the while cache will be read, unless...
- Either before/after the cache is complete, it can get whiped by a timeout after a get_sqn request (e.g. as prompted by a journal compaction) ... it will then be re-filled of the back of get/get-range requests.
In all these situations we want to hibernate after the BIC is fill - to reflect the fact that the LoopState should now be relatively stable, so it is a good point to GC and rationalise location of data.
Previously on the the first base was covered. Now all three are covered through the bic_complete message.
* Test all index keys have same term
This works functionally, but is not optimised (the term is replicated in the index)
* Summaries with same index term
If the summary index all have the same index term - only the object keys need to be indexes
* Simplify case statements
We either match the pattern of <<Prefix:N, Suffix>> or the answer should be null
* OK for M == N
If M = N for the first key, it will have a suffix of <<>>. This will match (as expected) a query Start Key of the sam size, and be smaller than any query Start Key that has the same prefix.
If the query Start Key does not match the prefix - it will be null - as it must be smaller than the Prefix (as other wise the query Start Key would be bigger than the Last Key).
The constraint of M > N was introduced before the *_prefix_filter functions were checking the prefix, to avoid issues. Now the prefix is being checked, then M == N is ok.
* Simplify
Correct the test to use a binary field in the range.
To avoid further issue, only apply filter when everything is a binary() type.
* Add test for head_only mode
When leveled is used as a tictacaae key store (in parallel mode), the keys will be head_only entries. Double check they are handled as expected like object keys
* Revert previous change - must support typed buckets
Add assertion to confirm worthwhile optimisation
* Add support for configurable cache multiple (#375)
* Mas i370 patch e (#385)
Improvement to monitoring for efficiency and improved readability of logs and stats.
As part of this, where possible, tried to avoid updating loop state on READ messages in leveled processes (as was the case when tracking stats within each process).
No performance benefits found with change, but improved stats has helped discover other potential gains.
2022-12-18 20:18:03 +00:00
|
|
|
leveled_pmem:new_index();
|
2017-03-06 18:42:32 +00:00
|
|
|
_ ->
|
Develop 3.1 d30update (#386)
* Mas i370 patch d (#383)
* Refactor penciller memory
In high-volume tests on large key-count clusters, so significant variation in the P0031 time has been seen:
TimeBucket PatchA
a.0ms_to_1ms 18554
b.1ms_to_2ms 51778
c.2ms_to_3ms 696
d.3ms_to_5ms 220
e.5ms_to_8ms 59
f.8ms_to_13ms 40
g.13ms_to_21ms 364
h.21ms_to_34ms 277
i.34ms_to_55ms 34
j.55ms_to_89ms 17
k.89ms_to_144ms 21
l.144ms_to_233ms 31
m.233ms_to_377ms 45
n.377ms_to_610ms 52
o.610ms_to_987ms 59
p.987ms_to_1597ms 55
q.1597ms_to_2684ms 54
r.2684ms_to_4281ms 29
s.4281ms_to_6965ms 7
t.6295ms_to_11246ms 1
It is unclear why this varies so much. The time to add to the cache appears to be minimal (but perhaps there is an issue with timing points in the code), whereas the time to add to the index is much more significant and variable. There is also variable time when the memory is rolled (although the actual activity here appears to be minimal.
The refactoring here is two-fold:
- tidy and simplify by keeping LoopState managed within handle_call, and add more helpful dialyzer specs;
- change the update to the index to be a simple extension of a list, rather than any conversion.
This alternative version of the pmem index in unit test is orders of magnitude faster to add - and is the same order of magnitude to check. Anticipation is that it may be more efficient in terms of memory changes.
* Compress SST index
Reduces the size of the leveled_sst index with two changes:
1 - Where there is a common prefix of tuple elements (e.g. Bucket) across the whole leveled_sst file - only the non-common part is indexed, and a function is used to compare.
2 - There is less "indexing" of the index i.e. only 1 in 16 keys are passed into the gb_trees part instead of 1 in 4
* Immediate hibernate
Reasons for delay in hibernate were not clear.
Straight after creation the process will not be in receipt of messages (must wait for the manifest to be updated), so better to hibernate now. This also means the log PC023 provides more accurate information.
* Refactor BIC
This patch avoids the following:
- repeated replacement of the same element in the BIC (via get_kvrange), by checking presence via GET before sing SET
- Stops re-reading of all elements to discover high modified date
Also there appears to have been a bug where a missing HMD for the file is required to add to the cache. However, now the cache may be erased without erasing the HMD. This means that the cache can never be rebuilt
* Use correct size in test results
erts_debug:flat_size/1 returns size in words (i.e. 8 bytes on 64-bit CPU) not bytes
* Don't change summary record
As it is persisted as part of the file write, any change to the summary record cannot be rolled back
* Clerk to prompt L0 write
Simplifies the logic if the clerk request work for the penciller prompts L0 writes as well as Manifest changes.
The advantage now is that if the penciller memory is full, and PUT load stops, the clerk should still be able to prompt persistence. the penciller can therefore make use of dead time this way
* Add push on journal compact
If there has been a backlog, followed by a quiet period - there may be a large ledger cache left unpushed. Journal compaction events are about once per hour, so the performance overhead of a false push should be minimal, with the advantage of clearing any backlog before load starts again.
This is only relevant to riak users with very off/full batch type workloads.
* Extend tests
To more consistently trigger all overload scenarios
* Fix range keys smaller than prefix
Can't make end key an empty binary in this case, as it may be bigger than any keys within the range, but will appear to be smaller.
Unit tests and ct tests added to expose the potential issue
* Tidy-up
- Remove penciller logs which are no longer called
- Get pclerk to only wait MIN_TIMEOUT after doing work, in case there is a backlog
- Remove update_levelzero_cache function as it is unique to handle_call of push_mem, and simple enough to be inline
- Alight testutil slow offer with standard slow offer used
* Tidy-up
Remove pre-otp20 references.
Reinstate the check that the starting pid is still active, this was added to tidy up shutdown.
Resolve failure to run on otp20 due to `-if` sttaement
* Tidy up
Using null rather then {null, Key} is potentially clearer as it is not a concern what they Key is in this case, and removes a comparison step from the leveled_codec:endkey_passed/2 function.
There were issues with coverage in eunit tests as the leveled_pclerk shut down. This prompted a general tidy of leveled_pclerk (remove passing of LoopState into internal functions, and add dialyzer specs.
* Remove R16 relic
* Further testing another issue
The StartKey must always be less than or equal to the prefix when the first N characters are stripped, but this is not true of the EndKey (for the query) which does not have to be between the FirstKey and the LastKey.
If the EndKey query does not match it must be greater than the Prefix (as otherwise it would not have been greater than the FirstKey - so set to null.
* Fix unit test
Unit test had a typo - and result interpretation had a misunderstanding.
* Code and spec tidy
Also look to the cover the situation when the FirstKey is the same as the Prefix with tests.
This is, in theory, not an issue as it is the EndKey for each sublist which is indexed in leveled_tree. However, guard against it mapping to null here, just in case there are dangers lurking (note that tests will still pass without `M > N` guard in place.
* Hibernate on BIC complete
There are three situations when the BIC becomes complete:
- In a file created as part of a merge the BIS is learned in the merge
- After startup, files below L1 learn the block cache through reads that happen to read the block, eventually the while cache will be read, unless...
- Either before/after the cache is complete, it can get whiped by a timeout after a get_sqn request (e.g. as prompted by a journal compaction) ... it will then be re-filled of the back of get/get-range requests.
In all these situations we want to hibernate after the BIC is fill - to reflect the fact that the LoopState should now be relatively stable, so it is a good point to GC and rationalise location of data.
Previously on the the first base was covered. Now all three are covered through the bic_complete message.
* Test all index keys have same term
This works functionally, but is not optimised (the term is replicated in the index)
* Summaries with same index term
If the summary index all have the same index term - only the object keys need to be indexes
* Simplify case statements
We either match the pattern of <<Prefix:N, Suffix>> or the answer should be null
* OK for M == N
If M = N for the first key, it will have a suffix of <<>>. This will match (as expected) a query Start Key of the sam size, and be smaller than any query Start Key that has the same prefix.
If the query Start Key does not match the prefix - it will be null - as it must be smaller than the Prefix (as other wise the query Start Key would be bigger than the Last Key).
The constraint of M > N was introduced before the *_prefix_filter functions were checking the prefix, to avoid issues. Now the prefix is being checked, then M == N is ok.
* Simplify
Correct the test to use a binary field in the range.
To avoid further issue, only apply filter when everything is a binary() type.
* Add test for head_only mode
When leveled is used as a tictacaae key store (in parallel mode), the keys will be head_only entries. Double check they are handled as expected like object keys
* Revert previous change - must support typed buckets
Add assertion to confirm worthwhile optimisation
* Add support for configurable cache multiple (#375)
* Mas i370 patch e (#385)
Improvement to monitoring for efficiency and improved readability of logs and stats.
As part of this, where possible, tried to avoid updating loop state on READ messages in leveled processes (as was the case when tracking stats within each process).
No performance benefits found with change, but improved stats has helped discover other potential gains.
2022-12-18 20:18:03 +00:00
|
|
|
BookieIdx
|
2017-03-06 18:42:32 +00:00
|
|
|
end,
|
Develop 3.1 d30update (#386)
* Mas i370 patch d (#383)
* Refactor penciller memory
In high-volume tests on large key-count clusters, so significant variation in the P0031 time has been seen:
TimeBucket PatchA
a.0ms_to_1ms 18554
b.1ms_to_2ms 51778
c.2ms_to_3ms 696
d.3ms_to_5ms 220
e.5ms_to_8ms 59
f.8ms_to_13ms 40
g.13ms_to_21ms 364
h.21ms_to_34ms 277
i.34ms_to_55ms 34
j.55ms_to_89ms 17
k.89ms_to_144ms 21
l.144ms_to_233ms 31
m.233ms_to_377ms 45
n.377ms_to_610ms 52
o.610ms_to_987ms 59
p.987ms_to_1597ms 55
q.1597ms_to_2684ms 54
r.2684ms_to_4281ms 29
s.4281ms_to_6965ms 7
t.6295ms_to_11246ms 1
It is unclear why this varies so much. The time to add to the cache appears to be minimal (but perhaps there is an issue with timing points in the code), whereas the time to add to the index is much more significant and variable. There is also variable time when the memory is rolled (although the actual activity here appears to be minimal.
The refactoring here is two-fold:
- tidy and simplify by keeping LoopState managed within handle_call, and add more helpful dialyzer specs;
- change the update to the index to be a simple extension of a list, rather than any conversion.
This alternative version of the pmem index in unit test is orders of magnitude faster to add - and is the same order of magnitude to check. Anticipation is that it may be more efficient in terms of memory changes.
* Compress SST index
Reduces the size of the leveled_sst index with two changes:
1 - Where there is a common prefix of tuple elements (e.g. Bucket) across the whole leveled_sst file - only the non-common part is indexed, and a function is used to compare.
2 - There is less "indexing" of the index i.e. only 1 in 16 keys are passed into the gb_trees part instead of 1 in 4
* Immediate hibernate
Reasons for delay in hibernate were not clear.
Straight after creation the process will not be in receipt of messages (must wait for the manifest to be updated), so better to hibernate now. This also means the log PC023 provides more accurate information.
* Refactor BIC
This patch avoids the following:
- repeated replacement of the same element in the BIC (via get_kvrange), by checking presence via GET before sing SET
- Stops re-reading of all elements to discover high modified date
Also there appears to have been a bug where a missing HMD for the file is required to add to the cache. However, now the cache may be erased without erasing the HMD. This means that the cache can never be rebuilt
* Use correct size in test results
erts_debug:flat_size/1 returns size in words (i.e. 8 bytes on 64-bit CPU) not bytes
* Don't change summary record
As it is persisted as part of the file write, any change to the summary record cannot be rolled back
* Clerk to prompt L0 write
Simplifies the logic if the clerk request work for the penciller prompts L0 writes as well as Manifest changes.
The advantage now is that if the penciller memory is full, and PUT load stops, the clerk should still be able to prompt persistence. the penciller can therefore make use of dead time this way
* Add push on journal compact
If there has been a backlog, followed by a quiet period - there may be a large ledger cache left unpushed. Journal compaction events are about once per hour, so the performance overhead of a false push should be minimal, with the advantage of clearing any backlog before load starts again.
This is only relevant to riak users with very off/full batch type workloads.
* Extend tests
To more consistently trigger all overload scenarios
* Fix range keys smaller than prefix
Can't make end key an empty binary in this case, as it may be bigger than any keys within the range, but will appear to be smaller.
Unit tests and ct tests added to expose the potential issue
* Tidy-up
- Remove penciller logs which are no longer called
- Get pclerk to only wait MIN_TIMEOUT after doing work, in case there is a backlog
- Remove update_levelzero_cache function as it is unique to handle_call of push_mem, and simple enough to be inline
- Alight testutil slow offer with standard slow offer used
* Tidy-up
Remove pre-otp20 references.
Reinstate the check that the starting pid is still active, this was added to tidy up shutdown.
Resolve failure to run on otp20 due to `-if` sttaement
* Tidy up
Using null rather then {null, Key} is potentially clearer as it is not a concern what they Key is in this case, and removes a comparison step from the leveled_codec:endkey_passed/2 function.
There were issues with coverage in eunit tests as the leveled_pclerk shut down. This prompted a general tidy of leveled_pclerk (remove passing of LoopState into internal functions, and add dialyzer specs.
* Remove R16 relic
* Further testing another issue
The StartKey must always be less than or equal to the prefix when the first N characters are stripped, but this is not true of the EndKey (for the query) which does not have to be between the FirstKey and the LastKey.
If the EndKey query does not match it must be greater than the Prefix (as otherwise it would not have been greater than the FirstKey - so set to null.
* Fix unit test
Unit test had a typo - and result interpretation had a misunderstanding.
* Code and spec tidy
Also look to the cover the situation when the FirstKey is the same as the Prefix with tests.
This is, in theory, not an issue as it is the EndKey for each sublist which is indexed in leveled_tree. However, guard against it mapping to null here, just in case there are dangers lurking (note that tests will still pass without `M > N` guard in place.
* Hibernate on BIC complete
There are three situations when the BIC becomes complete:
- In a file created as part of a merge the BIS is learned in the merge
- After startup, files below L1 learn the block cache through reads that happen to read the block, eventually the while cache will be read, unless...
- Either before/after the cache is complete, it can get whiped by a timeout after a get_sqn request (e.g. as prompted by a journal compaction) ... it will then be re-filled of the back of get/get-range requests.
In all these situations we want to hibernate after the BIC is fill - to reflect the fact that the LoopState should now be relatively stable, so it is a good point to GC and rationalise location of data.
Previously on the the first base was covered. Now all three are covered through the bic_complete message.
* Test all index keys have same term
This works functionally, but is not optimised (the term is replicated in the index)
* Summaries with same index term
If the summary index all have the same index term - only the object keys need to be indexes
* Simplify case statements
We either match the pattern of <<Prefix:N, Suffix>> or the answer should be null
* OK for M == N
If M = N for the first key, it will have a suffix of <<>>. This will match (as expected) a query Start Key of the sam size, and be smaller than any query Start Key that has the same prefix.
If the query Start Key does not match the prefix - it will be null - as it must be smaller than the Prefix (as other wise the query Start Key would be bigger than the Last Key).
The constraint of M > N was introduced before the *_prefix_filter functions were checking the prefix, to avoid issues. Now the prefix is being checked, then M == N is ok.
* Simplify
Correct the test to use a binary field in the range.
To avoid further issue, only apply filter when everything is a binary() type.
* Add test for head_only mode
When leveled is used as a tictacaae key store (in parallel mode), the keys will be head_only entries. Double check they are handled as expected like object keys
* Revert previous change - must support typed buckets
Add assertion to confirm worthwhile optimisation
* Add support for configurable cache multiple (#375)
* Mas i370 patch e (#385)
Improvement to monitoring for efficiency and improved readability of logs and stats.
As part of this, where possible, tried to avoid updating loop state on READ messages in leveled processes (as was the case when tracking stats within each process).
No performance benefits found with change, but improved stats has helped discover other potential gains.
2022-12-18 20:18:03 +00:00
|
|
|
L0Index =
|
|
|
|
leveled_pmem:add_to_index(
|
|
|
|
LM1Idx, State#state.levelzero_index, length(L0Cache)),
|
2022-10-11 13:45:55 +01:00
|
|
|
{#state{levelzero_cache = L0Cache,
|
2017-03-06 18:42:32 +00:00
|
|
|
levelzero_index = L0Index,
|
|
|
|
levelzero_size = UpdSize,
|
|
|
|
ledger_sqn = UpdMaxSQN,
|
2022-10-11 13:45:55 +01:00
|
|
|
persisted_sqn = State#state.persisted_sqn},
|
|
|
|
leveled_pmanifest:copy_manifest(State#state.manifest),
|
|
|
|
undefined}
|
2017-03-02 17:49:43 +00:00
|
|
|
end,
|
2017-03-06 18:42:32 +00:00
|
|
|
{reply,
|
|
|
|
{ok,
|
2022-10-11 13:45:55 +01:00
|
|
|
CloneState#state{snapshot_fully_loaded = true,
|
2020-12-04 12:49:17 +00:00
|
|
|
snapshot_time = leveled_util:integer_now(),
|
2022-10-11 13:45:55 +01:00
|
|
|
manifest = ManifestClone,
|
|
|
|
query_manifest = QueryManifest}},
|
2017-03-06 18:42:32 +00:00
|
|
|
State#state{manifest = Manifest0}};
|
2018-04-10 09:51:21 +01:00
|
|
|
handle_call(close, _From, State=#state{is_snapshot=Snap}) when Snap == true ->
|
|
|
|
ok = pcl_releasesnapshot(State#state.source_penciller, self()),
|
|
|
|
{stop, normal, ok, State};
|
2023-10-03 18:30:40 +01:00
|
|
|
handle_call(close, From, State) ->
|
2018-04-10 09:51:21 +01:00
|
|
|
% Level 0 files lie outside of the manifest, and so if there is no L0
|
|
|
|
% file present it is safe to write the current contents of memory. If
|
|
|
|
% there is a L0 file present - then the memory can be dropped (it is
|
|
|
|
% recoverable from the ledger, and there should not be a lot to recover
|
|
|
|
% as presumably the ETS file has been recently flushed, hence the presence
|
|
|
|
% of a L0 file).
|
|
|
|
%
|
|
|
|
% The penciller should close each file in the manifest, and call a close
|
|
|
|
% on the clerk.
|
|
|
|
ok = leveled_pclerk:clerk_close(State#state.clerk),
|
Develop 3.1 d30update (#386)
* Mas i370 patch d (#383)
* Refactor penciller memory
In high-volume tests on large key-count clusters, so significant variation in the P0031 time has been seen:
TimeBucket PatchA
a.0ms_to_1ms 18554
b.1ms_to_2ms 51778
c.2ms_to_3ms 696
d.3ms_to_5ms 220
e.5ms_to_8ms 59
f.8ms_to_13ms 40
g.13ms_to_21ms 364
h.21ms_to_34ms 277
i.34ms_to_55ms 34
j.55ms_to_89ms 17
k.89ms_to_144ms 21
l.144ms_to_233ms 31
m.233ms_to_377ms 45
n.377ms_to_610ms 52
o.610ms_to_987ms 59
p.987ms_to_1597ms 55
q.1597ms_to_2684ms 54
r.2684ms_to_4281ms 29
s.4281ms_to_6965ms 7
t.6295ms_to_11246ms 1
It is unclear why this varies so much. The time to add to the cache appears to be minimal (but perhaps there is an issue with timing points in the code), whereas the time to add to the index is much more significant and variable. There is also variable time when the memory is rolled (although the actual activity here appears to be minimal.
The refactoring here is two-fold:
- tidy and simplify by keeping LoopState managed within handle_call, and add more helpful dialyzer specs;
- change the update to the index to be a simple extension of a list, rather than any conversion.
This alternative version of the pmem index in unit test is orders of magnitude faster to add - and is the same order of magnitude to check. Anticipation is that it may be more efficient in terms of memory changes.
* Compress SST index
Reduces the size of the leveled_sst index with two changes:
1 - Where there is a common prefix of tuple elements (e.g. Bucket) across the whole leveled_sst file - only the non-common part is indexed, and a function is used to compare.
2 - There is less "indexing" of the index i.e. only 1 in 16 keys are passed into the gb_trees part instead of 1 in 4
* Immediate hibernate
Reasons for delay in hibernate were not clear.
Straight after creation the process will not be in receipt of messages (must wait for the manifest to be updated), so better to hibernate now. This also means the log PC023 provides more accurate information.
* Refactor BIC
This patch avoids the following:
- repeated replacement of the same element in the BIC (via get_kvrange), by checking presence via GET before sing SET
- Stops re-reading of all elements to discover high modified date
Also there appears to have been a bug where a missing HMD for the file is required to add to the cache. However, now the cache may be erased without erasing the HMD. This means that the cache can never be rebuilt
* Use correct size in test results
erts_debug:flat_size/1 returns size in words (i.e. 8 bytes on 64-bit CPU) not bytes
* Don't change summary record
As it is persisted as part of the file write, any change to the summary record cannot be rolled back
* Clerk to prompt L0 write
Simplifies the logic if the clerk request work for the penciller prompts L0 writes as well as Manifest changes.
The advantage now is that if the penciller memory is full, and PUT load stops, the clerk should still be able to prompt persistence. the penciller can therefore make use of dead time this way
* Add push on journal compact
If there has been a backlog, followed by a quiet period - there may be a large ledger cache left unpushed. Journal compaction events are about once per hour, so the performance overhead of a false push should be minimal, with the advantage of clearing any backlog before load starts again.
This is only relevant to riak users with very off/full batch type workloads.
* Extend tests
To more consistently trigger all overload scenarios
* Fix range keys smaller than prefix
Can't make end key an empty binary in this case, as it may be bigger than any keys within the range, but will appear to be smaller.
Unit tests and ct tests added to expose the potential issue
* Tidy-up
- Remove penciller logs which are no longer called
- Get pclerk to only wait MIN_TIMEOUT after doing work, in case there is a backlog
- Remove update_levelzero_cache function as it is unique to handle_call of push_mem, and simple enough to be inline
- Alight testutil slow offer with standard slow offer used
* Tidy-up
Remove pre-otp20 references.
Reinstate the check that the starting pid is still active, this was added to tidy up shutdown.
Resolve failure to run on otp20 due to `-if` sttaement
* Tidy up
Using null rather then {null, Key} is potentially clearer as it is not a concern what they Key is in this case, and removes a comparison step from the leveled_codec:endkey_passed/2 function.
There were issues with coverage in eunit tests as the leveled_pclerk shut down. This prompted a general tidy of leveled_pclerk (remove passing of LoopState into internal functions, and add dialyzer specs.
* Remove R16 relic
* Further testing another issue
The StartKey must always be less than or equal to the prefix when the first N characters are stripped, but this is not true of the EndKey (for the query) which does not have to be between the FirstKey and the LastKey.
If the EndKey query does not match it must be greater than the Prefix (as otherwise it would not have been greater than the FirstKey - so set to null.
* Fix unit test
Unit test had a typo - and result interpretation had a misunderstanding.
* Code and spec tidy
Also look to the cover the situation when the FirstKey is the same as the Prefix with tests.
This is, in theory, not an issue as it is the EndKey for each sublist which is indexed in leveled_tree. However, guard against it mapping to null here, just in case there are dangers lurking (note that tests will still pass without `M > N` guard in place.
* Hibernate on BIC complete
There are three situations when the BIC becomes complete:
- In a file created as part of a merge the BIS is learned in the merge
- After startup, files below L1 learn the block cache through reads that happen to read the block, eventually the while cache will be read, unless...
- Either before/after the cache is complete, it can get whiped by a timeout after a get_sqn request (e.g. as prompted by a journal compaction) ... it will then be re-filled of the back of get/get-range requests.
In all these situations we want to hibernate after the BIC is fill - to reflect the fact that the LoopState should now be relatively stable, so it is a good point to GC and rationalise location of data.
Previously on the the first base was covered. Now all three are covered through the bic_complete message.
* Test all index keys have same term
This works functionally, but is not optimised (the term is replicated in the index)
* Summaries with same index term
If the summary index all have the same index term - only the object keys need to be indexes
* Simplify case statements
We either match the pattern of <<Prefix:N, Suffix>> or the answer should be null
* OK for M == N
If M = N for the first key, it will have a suffix of <<>>. This will match (as expected) a query Start Key of the sam size, and be smaller than any query Start Key that has the same prefix.
If the query Start Key does not match the prefix - it will be null - as it must be smaller than the Prefix (as other wise the query Start Key would be bigger than the Last Key).
The constraint of M > N was introduced before the *_prefix_filter functions were checking the prefix, to avoid issues. Now the prefix is being checked, then M == N is ok.
* Simplify
Correct the test to use a binary field in the range.
To avoid further issue, only apply filter when everything is a binary() type.
* Add test for head_only mode
When leveled is used as a tictacaae key store (in parallel mode), the keys will be head_only entries. Double check they are handled as expected like object keys
* Revert previous change - must support typed buckets
Add assertion to confirm worthwhile optimisation
* Add support for configurable cache multiple (#375)
* Mas i370 patch e (#385)
Improvement to monitoring for efficiency and improved readability of logs and stats.
As part of this, where possible, tried to avoid updating loop state on READ messages in leveled processes (as was the case when tracking stats within each process).
No performance benefits found with change, but improved stats has helped discover other potential gains.
2022-12-18 20:18:03 +00:00
|
|
|
leveled_log:log(p0008, [close]),
|
|
|
|
L0Left = State#state.levelzero_size > 0,
|
|
|
|
case (not State#state.levelzero_pending and L0Left) of
|
2019-02-26 10:33:20 +00:00
|
|
|
true ->
|
Develop 3.1 d30update (#386)
* Mas i370 patch d (#383)
* Refactor penciller memory
In high-volume tests on large key-count clusters, so significant variation in the P0031 time has been seen:
TimeBucket PatchA
a.0ms_to_1ms 18554
b.1ms_to_2ms 51778
c.2ms_to_3ms 696
d.3ms_to_5ms 220
e.5ms_to_8ms 59
f.8ms_to_13ms 40
g.13ms_to_21ms 364
h.21ms_to_34ms 277
i.34ms_to_55ms 34
j.55ms_to_89ms 17
k.89ms_to_144ms 21
l.144ms_to_233ms 31
m.233ms_to_377ms 45
n.377ms_to_610ms 52
o.610ms_to_987ms 59
p.987ms_to_1597ms 55
q.1597ms_to_2684ms 54
r.2684ms_to_4281ms 29
s.4281ms_to_6965ms 7
t.6295ms_to_11246ms 1
It is unclear why this varies so much. The time to add to the cache appears to be minimal (but perhaps there is an issue with timing points in the code), whereas the time to add to the index is much more significant and variable. There is also variable time when the memory is rolled (although the actual activity here appears to be minimal.
The refactoring here is two-fold:
- tidy and simplify by keeping LoopState managed within handle_call, and add more helpful dialyzer specs;
- change the update to the index to be a simple extension of a list, rather than any conversion.
This alternative version of the pmem index in unit test is orders of magnitude faster to add - and is the same order of magnitude to check. Anticipation is that it may be more efficient in terms of memory changes.
* Compress SST index
Reduces the size of the leveled_sst index with two changes:
1 - Where there is a common prefix of tuple elements (e.g. Bucket) across the whole leveled_sst file - only the non-common part is indexed, and a function is used to compare.
2 - There is less "indexing" of the index i.e. only 1 in 16 keys are passed into the gb_trees part instead of 1 in 4
* Immediate hibernate
Reasons for delay in hibernate were not clear.
Straight after creation the process will not be in receipt of messages (must wait for the manifest to be updated), so better to hibernate now. This also means the log PC023 provides more accurate information.
* Refactor BIC
This patch avoids the following:
- repeated replacement of the same element in the BIC (via get_kvrange), by checking presence via GET before sing SET
- Stops re-reading of all elements to discover high modified date
Also there appears to have been a bug where a missing HMD for the file is required to add to the cache. However, now the cache may be erased without erasing the HMD. This means that the cache can never be rebuilt
* Use correct size in test results
erts_debug:flat_size/1 returns size in words (i.e. 8 bytes on 64-bit CPU) not bytes
* Don't change summary record
As it is persisted as part of the file write, any change to the summary record cannot be rolled back
* Clerk to prompt L0 write
Simplifies the logic if the clerk request work for the penciller prompts L0 writes as well as Manifest changes.
The advantage now is that if the penciller memory is full, and PUT load stops, the clerk should still be able to prompt persistence. the penciller can therefore make use of dead time this way
* Add push on journal compact
If there has been a backlog, followed by a quiet period - there may be a large ledger cache left unpushed. Journal compaction events are about once per hour, so the performance overhead of a false push should be minimal, with the advantage of clearing any backlog before load starts again.
This is only relevant to riak users with very off/full batch type workloads.
* Extend tests
To more consistently trigger all overload scenarios
* Fix range keys smaller than prefix
Can't make end key an empty binary in this case, as it may be bigger than any keys within the range, but will appear to be smaller.
Unit tests and ct tests added to expose the potential issue
* Tidy-up
- Remove penciller logs which are no longer called
- Get pclerk to only wait MIN_TIMEOUT after doing work, in case there is a backlog
- Remove update_levelzero_cache function as it is unique to handle_call of push_mem, and simple enough to be inline
- Alight testutil slow offer with standard slow offer used
* Tidy-up
Remove pre-otp20 references.
Reinstate the check that the starting pid is still active, this was added to tidy up shutdown.
Resolve failure to run on otp20 due to `-if` sttaement
* Tidy up
Using null rather then {null, Key} is potentially clearer as it is not a concern what they Key is in this case, and removes a comparison step from the leveled_codec:endkey_passed/2 function.
There were issues with coverage in eunit tests as the leveled_pclerk shut down. This prompted a general tidy of leveled_pclerk (remove passing of LoopState into internal functions, and add dialyzer specs.
* Remove R16 relic
* Further testing another issue
The StartKey must always be less than or equal to the prefix when the first N characters are stripped, but this is not true of the EndKey (for the query) which does not have to be between the FirstKey and the LastKey.
If the EndKey query does not match it must be greater than the Prefix (as otherwise it would not have been greater than the FirstKey - so set to null.
* Fix unit test
Unit test had a typo - and result interpretation had a misunderstanding.
* Code and spec tidy
Also look to the cover the situation when the FirstKey is the same as the Prefix with tests.
This is, in theory, not an issue as it is the EndKey for each sublist which is indexed in leveled_tree. However, guard against it mapping to null here, just in case there are dangers lurking (note that tests will still pass without `M > N` guard in place.
* Hibernate on BIC complete
There are three situations when the BIC becomes complete:
- In a file created as part of a merge the BIS is learned in the merge
- After startup, files below L1 learn the block cache through reads that happen to read the block, eventually the while cache will be read, unless...
- Either before/after the cache is complete, it can get whiped by a timeout after a get_sqn request (e.g. as prompted by a journal compaction) ... it will then be re-filled of the back of get/get-range requests.
In all these situations we want to hibernate after the BIC is fill - to reflect the fact that the LoopState should now be relatively stable, so it is a good point to GC and rationalise location of data.
Previously on the the first base was covered. Now all three are covered through the bic_complete message.
* Test all index keys have same term
This works functionally, but is not optimised (the term is replicated in the index)
* Summaries with same index term
If the summary index all have the same index term - only the object keys need to be indexes
* Simplify case statements
We either match the pattern of <<Prefix:N, Suffix>> or the answer should be null
* OK for M == N
If M = N for the first key, it will have a suffix of <<>>. This will match (as expected) a query Start Key of the sam size, and be smaller than any query Start Key that has the same prefix.
If the query Start Key does not match the prefix - it will be null - as it must be smaller than the Prefix (as other wise the query Start Key would be bigger than the Last Key).
The constraint of M > N was introduced before the *_prefix_filter functions were checking the prefix, to avoid issues. Now the prefix is being checked, then M == N is ok.
* Simplify
Correct the test to use a binary field in the range.
To avoid further issue, only apply filter when everything is a binary() type.
* Add test for head_only mode
When leveled is used as a tictacaae key store (in parallel mode), the keys will be head_only entries. Double check they are handled as expected like object keys
* Revert previous change - must support typed buckets
Add assertion to confirm worthwhile optimisation
* Add support for configurable cache multiple (#375)
* Mas i370 patch e (#385)
Improvement to monitoring for efficiency and improved readability of logs and stats.
As part of this, where possible, tried to avoid updating loop state on READ messages in leveled processes (as was the case when tracking stats within each process).
No performance benefits found with change, but improved stats has helped discover other potential gains.
2022-12-18 20:18:03 +00:00
|
|
|
Man0 = State#state.manifest,
|
|
|
|
{Constructor, _} =
|
|
|
|
roll_memory(
|
|
|
|
leveled_pmanifest:get_manifest_sqn(Man0) + 1,
|
|
|
|
State#state.ledger_sqn,
|
|
|
|
State#state.root_path,
|
|
|
|
State#state.levelzero_cache,
|
|
|
|
length(State#state.levelzero_cache),
|
|
|
|
State#state.sst_options,
|
|
|
|
true),
|
|
|
|
ok = leveled_sst:sst_close(Constructor);
|
2019-02-26 18:51:29 +00:00
|
|
|
false ->
|
Develop 3.1 d30update (#386)
* Mas i370 patch d (#383)
* Refactor penciller memory
In high-volume tests on large key-count clusters, so significant variation in the P0031 time has been seen:
TimeBucket PatchA
a.0ms_to_1ms 18554
b.1ms_to_2ms 51778
c.2ms_to_3ms 696
d.3ms_to_5ms 220
e.5ms_to_8ms 59
f.8ms_to_13ms 40
g.13ms_to_21ms 364
h.21ms_to_34ms 277
i.34ms_to_55ms 34
j.55ms_to_89ms 17
k.89ms_to_144ms 21
l.144ms_to_233ms 31
m.233ms_to_377ms 45
n.377ms_to_610ms 52
o.610ms_to_987ms 59
p.987ms_to_1597ms 55
q.1597ms_to_2684ms 54
r.2684ms_to_4281ms 29
s.4281ms_to_6965ms 7
t.6295ms_to_11246ms 1
It is unclear why this varies so much. The time to add to the cache appears to be minimal (but perhaps there is an issue with timing points in the code), whereas the time to add to the index is much more significant and variable. There is also variable time when the memory is rolled (although the actual activity here appears to be minimal.
The refactoring here is two-fold:
- tidy and simplify by keeping LoopState managed within handle_call, and add more helpful dialyzer specs;
- change the update to the index to be a simple extension of a list, rather than any conversion.
This alternative version of the pmem index in unit test is orders of magnitude faster to add - and is the same order of magnitude to check. Anticipation is that it may be more efficient in terms of memory changes.
* Compress SST index
Reduces the size of the leveled_sst index with two changes:
1 - Where there is a common prefix of tuple elements (e.g. Bucket) across the whole leveled_sst file - only the non-common part is indexed, and a function is used to compare.
2 - There is less "indexing" of the index i.e. only 1 in 16 keys are passed into the gb_trees part instead of 1 in 4
* Immediate hibernate
Reasons for delay in hibernate were not clear.
Straight after creation the process will not be in receipt of messages (must wait for the manifest to be updated), so better to hibernate now. This also means the log PC023 provides more accurate information.
* Refactor BIC
This patch avoids the following:
- repeated replacement of the same element in the BIC (via get_kvrange), by checking presence via GET before sing SET
- Stops re-reading of all elements to discover high modified date
Also there appears to have been a bug where a missing HMD for the file is required to add to the cache. However, now the cache may be erased without erasing the HMD. This means that the cache can never be rebuilt
* Use correct size in test results
erts_debug:flat_size/1 returns size in words (i.e. 8 bytes on 64-bit CPU) not bytes
* Don't change summary record
As it is persisted as part of the file write, any change to the summary record cannot be rolled back
* Clerk to prompt L0 write
Simplifies the logic if the clerk request work for the penciller prompts L0 writes as well as Manifest changes.
The advantage now is that if the penciller memory is full, and PUT load stops, the clerk should still be able to prompt persistence. the penciller can therefore make use of dead time this way
* Add push on journal compact
If there has been a backlog, followed by a quiet period - there may be a large ledger cache left unpushed. Journal compaction events are about once per hour, so the performance overhead of a false push should be minimal, with the advantage of clearing any backlog before load starts again.
This is only relevant to riak users with very off/full batch type workloads.
* Extend tests
To more consistently trigger all overload scenarios
* Fix range keys smaller than prefix
Can't make end key an empty binary in this case, as it may be bigger than any keys within the range, but will appear to be smaller.
Unit tests and ct tests added to expose the potential issue
* Tidy-up
- Remove penciller logs which are no longer called
- Get pclerk to only wait MIN_TIMEOUT after doing work, in case there is a backlog
- Remove update_levelzero_cache function as it is unique to handle_call of push_mem, and simple enough to be inline
- Alight testutil slow offer with standard slow offer used
* Tidy-up
Remove pre-otp20 references.
Reinstate the check that the starting pid is still active, this was added to tidy up shutdown.
Resolve failure to run on otp20 due to `-if` sttaement
* Tidy up
Using null rather then {null, Key} is potentially clearer as it is not a concern what they Key is in this case, and removes a comparison step from the leveled_codec:endkey_passed/2 function.
There were issues with coverage in eunit tests as the leveled_pclerk shut down. This prompted a general tidy of leveled_pclerk (remove passing of LoopState into internal functions, and add dialyzer specs.
* Remove R16 relic
* Further testing another issue
The StartKey must always be less than or equal to the prefix when the first N characters are stripped, but this is not true of the EndKey (for the query) which does not have to be between the FirstKey and the LastKey.
If the EndKey query does not match it must be greater than the Prefix (as otherwise it would not have been greater than the FirstKey - so set to null.
* Fix unit test
Unit test had a typo - and result interpretation had a misunderstanding.
* Code and spec tidy
Also look to the cover the situation when the FirstKey is the same as the Prefix with tests.
This is, in theory, not an issue as it is the EndKey for each sublist which is indexed in leveled_tree. However, guard against it mapping to null here, just in case there are dangers lurking (note that tests will still pass without `M > N` guard in place.
* Hibernate on BIC complete
There are three situations when the BIC becomes complete:
- In a file created as part of a merge the BIS is learned in the merge
- After startup, files below L1 learn the block cache through reads that happen to read the block, eventually the while cache will be read, unless...
- Either before/after the cache is complete, it can get whiped by a timeout after a get_sqn request (e.g. as prompted by a journal compaction) ... it will then be re-filled of the back of get/get-range requests.
In all these situations we want to hibernate after the BIC is fill - to reflect the fact that the LoopState should now be relatively stable, so it is a good point to GC and rationalise location of data.
Previously on the the first base was covered. Now all three are covered through the bic_complete message.
* Test all index keys have same term
This works functionally, but is not optimised (the term is replicated in the index)
* Summaries with same index term
If the summary index all have the same index term - only the object keys need to be indexes
* Simplify case statements
We either match the pattern of <<Prefix:N, Suffix>> or the answer should be null
* OK for M == N
If M = N for the first key, it will have a suffix of <<>>. This will match (as expected) a query Start Key of the sam size, and be smaller than any query Start Key that has the same prefix.
If the query Start Key does not match the prefix - it will be null - as it must be smaller than the Prefix (as other wise the query Start Key would be bigger than the Last Key).
The constraint of M > N was introduced before the *_prefix_filter functions were checking the prefix, to avoid issues. Now the prefix is being checked, then M == N is ok.
* Simplify
Correct the test to use a binary field in the range.
To avoid further issue, only apply filter when everything is a binary() type.
* Add test for head_only mode
When leveled is used as a tictacaae key store (in parallel mode), the keys will be head_only entries. Double check they are handled as expected like object keys
* Revert previous change - must support typed buckets
Add assertion to confirm worthwhile optimisation
* Add support for configurable cache multiple (#375)
* Mas i370 patch e (#385)
Improvement to monitoring for efficiency and improved readability of logs and stats.
As part of this, where possible, tried to avoid updating loop state on READ messages in leveled processes (as was the case when tracking stats within each process).
No performance benefits found with change, but improved stats has helped discover other potential gains.
2022-12-18 20:18:03 +00:00
|
|
|
leveled_log:log(p0010, [State#state.levelzero_size])
|
2018-04-10 09:51:21 +01:00
|
|
|
end,
|
2023-10-03 18:30:40 +01:00
|
|
|
gen_server:cast(self(), {maybe_defer_shutdown, close, From}),
|
|
|
|
{noreply, State};
|
|
|
|
handle_call(doom, From, State) ->
|
Develop 3.1 d30update (#386)
* Mas i370 patch d (#383)
* Refactor penciller memory
In high-volume tests on large key-count clusters, so significant variation in the P0031 time has been seen:
TimeBucket PatchA
a.0ms_to_1ms 18554
b.1ms_to_2ms 51778
c.2ms_to_3ms 696
d.3ms_to_5ms 220
e.5ms_to_8ms 59
f.8ms_to_13ms 40
g.13ms_to_21ms 364
h.21ms_to_34ms 277
i.34ms_to_55ms 34
j.55ms_to_89ms 17
k.89ms_to_144ms 21
l.144ms_to_233ms 31
m.233ms_to_377ms 45
n.377ms_to_610ms 52
o.610ms_to_987ms 59
p.987ms_to_1597ms 55
q.1597ms_to_2684ms 54
r.2684ms_to_4281ms 29
s.4281ms_to_6965ms 7
t.6295ms_to_11246ms 1
It is unclear why this varies so much. The time to add to the cache appears to be minimal (but perhaps there is an issue with timing points in the code), whereas the time to add to the index is much more significant and variable. There is also variable time when the memory is rolled (although the actual activity here appears to be minimal.
The refactoring here is two-fold:
- tidy and simplify by keeping LoopState managed within handle_call, and add more helpful dialyzer specs;
- change the update to the index to be a simple extension of a list, rather than any conversion.
This alternative version of the pmem index in unit test is orders of magnitude faster to add - and is the same order of magnitude to check. Anticipation is that it may be more efficient in terms of memory changes.
* Compress SST index
Reduces the size of the leveled_sst index with two changes:
1 - Where there is a common prefix of tuple elements (e.g. Bucket) across the whole leveled_sst file - only the non-common part is indexed, and a function is used to compare.
2 - There is less "indexing" of the index i.e. only 1 in 16 keys are passed into the gb_trees part instead of 1 in 4
* Immediate hibernate
Reasons for delay in hibernate were not clear.
Straight after creation the process will not be in receipt of messages (must wait for the manifest to be updated), so better to hibernate now. This also means the log PC023 provides more accurate information.
* Refactor BIC
This patch avoids the following:
- repeated replacement of the same element in the BIC (via get_kvrange), by checking presence via GET before sing SET
- Stops re-reading of all elements to discover high modified date
Also there appears to have been a bug where a missing HMD for the file is required to add to the cache. However, now the cache may be erased without erasing the HMD. This means that the cache can never be rebuilt
* Use correct size in test results
erts_debug:flat_size/1 returns size in words (i.e. 8 bytes on 64-bit CPU) not bytes
* Don't change summary record
As it is persisted as part of the file write, any change to the summary record cannot be rolled back
* Clerk to prompt L0 write
Simplifies the logic if the clerk request work for the penciller prompts L0 writes as well as Manifest changes.
The advantage now is that if the penciller memory is full, and PUT load stops, the clerk should still be able to prompt persistence. the penciller can therefore make use of dead time this way
* Add push on journal compact
If there has been a backlog, followed by a quiet period - there may be a large ledger cache left unpushed. Journal compaction events are about once per hour, so the performance overhead of a false push should be minimal, with the advantage of clearing any backlog before load starts again.
This is only relevant to riak users with very off/full batch type workloads.
* Extend tests
To more consistently trigger all overload scenarios
* Fix range keys smaller than prefix
Can't make end key an empty binary in this case, as it may be bigger than any keys within the range, but will appear to be smaller.
Unit tests and ct tests added to expose the potential issue
* Tidy-up
- Remove penciller logs which are no longer called
- Get pclerk to only wait MIN_TIMEOUT after doing work, in case there is a backlog
- Remove update_levelzero_cache function as it is unique to handle_call of push_mem, and simple enough to be inline
- Alight testutil slow offer with standard slow offer used
* Tidy-up
Remove pre-otp20 references.
Reinstate the check that the starting pid is still active, this was added to tidy up shutdown.
Resolve failure to run on otp20 due to `-if` sttaement
* Tidy up
Using null rather then {null, Key} is potentially clearer as it is not a concern what they Key is in this case, and removes a comparison step from the leveled_codec:endkey_passed/2 function.
There were issues with coverage in eunit tests as the leveled_pclerk shut down. This prompted a general tidy of leveled_pclerk (remove passing of LoopState into internal functions, and add dialyzer specs.
* Remove R16 relic
* Further testing another issue
The StartKey must always be less than or equal to the prefix when the first N characters are stripped, but this is not true of the EndKey (for the query) which does not have to be between the FirstKey and the LastKey.
If the EndKey query does not match it must be greater than the Prefix (as otherwise it would not have been greater than the FirstKey - so set to null.
* Fix unit test
Unit test had a typo - and result interpretation had a misunderstanding.
* Code and spec tidy
Also look to the cover the situation when the FirstKey is the same as the Prefix with tests.
This is, in theory, not an issue as it is the EndKey for each sublist which is indexed in leveled_tree. However, guard against it mapping to null here, just in case there are dangers lurking (note that tests will still pass without `M > N` guard in place.
* Hibernate on BIC complete
There are three situations when the BIC becomes complete:
- In a file created as part of a merge the BIS is learned in the merge
- After startup, files below L1 learn the block cache through reads that happen to read the block, eventually the while cache will be read, unless...
- Either before/after the cache is complete, it can get whiped by a timeout after a get_sqn request (e.g. as prompted by a journal compaction) ... it will then be re-filled of the back of get/get-range requests.
In all these situations we want to hibernate after the BIC is fill - to reflect the fact that the LoopState should now be relatively stable, so it is a good point to GC and rationalise location of data.
Previously on the the first base was covered. Now all three are covered through the bic_complete message.
* Test all index keys have same term
This works functionally, but is not optimised (the term is replicated in the index)
* Summaries with same index term
If the summary index all have the same index term - only the object keys need to be indexes
* Simplify case statements
We either match the pattern of <<Prefix:N, Suffix>> or the answer should be null
* OK for M == N
If M = N for the first key, it will have a suffix of <<>>. This will match (as expected) a query Start Key of the sam size, and be smaller than any query Start Key that has the same prefix.
If the query Start Key does not match the prefix - it will be null - as it must be smaller than the Prefix (as other wise the query Start Key would be bigger than the Last Key).
The constraint of M > N was introduced before the *_prefix_filter functions were checking the prefix, to avoid issues. Now the prefix is being checked, then M == N is ok.
* Simplify
Correct the test to use a binary field in the range.
To avoid further issue, only apply filter when everything is a binary() type.
* Add test for head_only mode
When leveled is used as a tictacaae key store (in parallel mode), the keys will be head_only entries. Double check they are handled as expected like object keys
* Revert previous change - must support typed buckets
Add assertion to confirm worthwhile optimisation
* Add support for configurable cache multiple (#375)
* Mas i370 patch e (#385)
Improvement to monitoring for efficiency and improved readability of logs and stats.
As part of this, where possible, tried to avoid updating loop state on READ messages in leveled processes (as was the case when tracking stats within each process).
No performance benefits found with change, but improved stats has helped discover other potential gains.
2022-12-18 20:18:03 +00:00
|
|
|
leveled_log:log(p0030, []),
|
2018-04-10 09:51:21 +01:00
|
|
|
ok = leveled_pclerk:clerk_close(State#state.clerk),
|
2023-10-03 18:30:40 +01:00
|
|
|
gen_server:cast(self(), {maybe_defer_shutdown, doom, From}),
|
|
|
|
{noreply, State};
|
2017-11-28 11:43:46 +00:00
|
|
|
handle_call({checkbloom_fortest, Key, Hash}, _From, State) ->
|
|
|
|
Manifest = State#state.manifest,
|
|
|
|
FoldFun =
|
|
|
|
fun(Level, Acc) ->
|
|
|
|
case Acc of
|
|
|
|
true ->
|
|
|
|
true;
|
|
|
|
false ->
|
|
|
|
case leveled_pmanifest:key_lookup(Manifest, Level, Key) of
|
|
|
|
false ->
|
|
|
|
false;
|
|
|
|
FP ->
|
|
|
|
leveled_pmanifest:check_bloom(Manifest, FP, Hash)
|
|
|
|
end
|
|
|
|
end
|
|
|
|
end,
|
|
|
|
{reply, lists:foldl(FoldFun, false, lists:seq(0, ?MAX_LEVELS)), State};
|
|
|
|
handle_call(check_for_work, _From, State) ->
|
2022-10-11 13:45:55 +01:00
|
|
|
{_WL, WC} = leveled_pmanifest:check_for_work(State#state.manifest),
|
2018-02-15 16:14:46 +00:00
|
|
|
{reply, WC > 0, State};
|
|
|
|
handle_call(persisted_sqn, _From, State) ->
|
|
|
|
{reply, State#state.persisted_sqn, State}.
|
2016-10-27 20:56:18 +01:00
|
|
|
|
2019-02-26 10:33:20 +00:00
|
|
|
handle_cast({manifest_change, Manifest}, State) ->
|
|
|
|
NewManSQN = leveled_pmanifest:get_manifest_sqn(Manifest),
|
2017-10-24 13:19:30 +01:00
|
|
|
OldManSQN = leveled_pmanifest:get_manifest_sqn(State#state.manifest),
|
Develop 3.1 d30update (#386)
* Mas i370 patch d (#383)
* Refactor penciller memory
In high-volume tests on large key-count clusters, so significant variation in the P0031 time has been seen:
TimeBucket PatchA
a.0ms_to_1ms 18554
b.1ms_to_2ms 51778
c.2ms_to_3ms 696
d.3ms_to_5ms 220
e.5ms_to_8ms 59
f.8ms_to_13ms 40
g.13ms_to_21ms 364
h.21ms_to_34ms 277
i.34ms_to_55ms 34
j.55ms_to_89ms 17
k.89ms_to_144ms 21
l.144ms_to_233ms 31
m.233ms_to_377ms 45
n.377ms_to_610ms 52
o.610ms_to_987ms 59
p.987ms_to_1597ms 55
q.1597ms_to_2684ms 54
r.2684ms_to_4281ms 29
s.4281ms_to_6965ms 7
t.6295ms_to_11246ms 1
It is unclear why this varies so much. The time to add to the cache appears to be minimal (but perhaps there is an issue with timing points in the code), whereas the time to add to the index is much more significant and variable. There is also variable time when the memory is rolled (although the actual activity here appears to be minimal.
The refactoring here is two-fold:
- tidy and simplify by keeping LoopState managed within handle_call, and add more helpful dialyzer specs;
- change the update to the index to be a simple extension of a list, rather than any conversion.
This alternative version of the pmem index in unit test is orders of magnitude faster to add - and is the same order of magnitude to check. Anticipation is that it may be more efficient in terms of memory changes.
* Compress SST index
Reduces the size of the leveled_sst index with two changes:
1 - Where there is a common prefix of tuple elements (e.g. Bucket) across the whole leveled_sst file - only the non-common part is indexed, and a function is used to compare.
2 - There is less "indexing" of the index i.e. only 1 in 16 keys are passed into the gb_trees part instead of 1 in 4
* Immediate hibernate
Reasons for delay in hibernate were not clear.
Straight after creation the process will not be in receipt of messages (must wait for the manifest to be updated), so better to hibernate now. This also means the log PC023 provides more accurate information.
* Refactor BIC
This patch avoids the following:
- repeated replacement of the same element in the BIC (via get_kvrange), by checking presence via GET before sing SET
- Stops re-reading of all elements to discover high modified date
Also there appears to have been a bug where a missing HMD for the file is required to add to the cache. However, now the cache may be erased without erasing the HMD. This means that the cache can never be rebuilt
* Use correct size in test results
erts_debug:flat_size/1 returns size in words (i.e. 8 bytes on 64-bit CPU) not bytes
* Don't change summary record
As it is persisted as part of the file write, any change to the summary record cannot be rolled back
* Clerk to prompt L0 write
Simplifies the logic if the clerk request work for the penciller prompts L0 writes as well as Manifest changes.
The advantage now is that if the penciller memory is full, and PUT load stops, the clerk should still be able to prompt persistence. the penciller can therefore make use of dead time this way
* Add push on journal compact
If there has been a backlog, followed by a quiet period - there may be a large ledger cache left unpushed. Journal compaction events are about once per hour, so the performance overhead of a false push should be minimal, with the advantage of clearing any backlog before load starts again.
This is only relevant to riak users with very off/full batch type workloads.
* Extend tests
To more consistently trigger all overload scenarios
* Fix range keys smaller than prefix
Can't make end key an empty binary in this case, as it may be bigger than any keys within the range, but will appear to be smaller.
Unit tests and ct tests added to expose the potential issue
* Tidy-up
- Remove penciller logs which are no longer called
- Get pclerk to only wait MIN_TIMEOUT after doing work, in case there is a backlog
- Remove update_levelzero_cache function as it is unique to handle_call of push_mem, and simple enough to be inline
- Alight testutil slow offer with standard slow offer used
* Tidy-up
Remove pre-otp20 references.
Reinstate the check that the starting pid is still active, this was added to tidy up shutdown.
Resolve failure to run on otp20 due to `-if` sttaement
* Tidy up
Using null rather then {null, Key} is potentially clearer as it is not a concern what they Key is in this case, and removes a comparison step from the leveled_codec:endkey_passed/2 function.
There were issues with coverage in eunit tests as the leveled_pclerk shut down. This prompted a general tidy of leveled_pclerk (remove passing of LoopState into internal functions, and add dialyzer specs.
* Remove R16 relic
* Further testing another issue
The StartKey must always be less than or equal to the prefix when the first N characters are stripped, but this is not true of the EndKey (for the query) which does not have to be between the FirstKey and the LastKey.
If the EndKey query does not match it must be greater than the Prefix (as otherwise it would not have been greater than the FirstKey - so set to null.
* Fix unit test
Unit test had a typo - and result interpretation had a misunderstanding.
* Code and spec tidy
Also look to the cover the situation when the FirstKey is the same as the Prefix with tests.
This is, in theory, not an issue as it is the EndKey for each sublist which is indexed in leveled_tree. However, guard against it mapping to null here, just in case there are dangers lurking (note that tests will still pass without `M > N` guard in place.
* Hibernate on BIC complete
There are three situations when the BIC becomes complete:
- In a file created as part of a merge the BIS is learned in the merge
- After startup, files below L1 learn the block cache through reads that happen to read the block, eventually the while cache will be read, unless...
- Either before/after the cache is complete, it can get whiped by a timeout after a get_sqn request (e.g. as prompted by a journal compaction) ... it will then be re-filled of the back of get/get-range requests.
In all these situations we want to hibernate after the BIC is fill - to reflect the fact that the LoopState should now be relatively stable, so it is a good point to GC and rationalise location of data.
Previously on the the first base was covered. Now all three are covered through the bic_complete message.
* Test all index keys have same term
This works functionally, but is not optimised (the term is replicated in the index)
* Summaries with same index term
If the summary index all have the same index term - only the object keys need to be indexes
* Simplify case statements
We either match the pattern of <<Prefix:N, Suffix>> or the answer should be null
* OK for M == N
If M = N for the first key, it will have a suffix of <<>>. This will match (as expected) a query Start Key of the sam size, and be smaller than any query Start Key that has the same prefix.
If the query Start Key does not match the prefix - it will be null - as it must be smaller than the Prefix (as other wise the query Start Key would be bigger than the Last Key).
The constraint of M > N was introduced before the *_prefix_filter functions were checking the prefix, to avoid issues. Now the prefix is being checked, then M == N is ok.
* Simplify
Correct the test to use a binary field in the range.
To avoid further issue, only apply filter when everything is a binary() type.
* Add test for head_only mode
When leveled is used as a tictacaae key store (in parallel mode), the keys will be head_only entries. Double check they are handled as expected like object keys
* Revert previous change - must support typed buckets
Add assertion to confirm worthwhile optimisation
* Add support for configurable cache multiple (#375)
* Mas i370 patch e (#385)
Improvement to monitoring for efficiency and improved readability of logs and stats.
As part of this, where possible, tried to avoid updating loop state on READ messages in leveled processes (as was the case when tracking stats within each process).
No performance benefits found with change, but improved stats has helped discover other potential gains.
2022-12-18 20:18:03 +00:00
|
|
|
leveled_log:log(p0041, [OldManSQN, NewManSQN]),
|
2019-02-26 10:33:20 +00:00
|
|
|
% Only safe to update the manifest if the SQN increments
|
|
|
|
if NewManSQN > OldManSQN ->
|
|
|
|
ok =
|
|
|
|
leveled_pclerk:clerk_promptdeletions(State#state.clerk, NewManSQN),
|
|
|
|
% This is accepted as the new manifest, files may be deleted
|
2022-05-24 09:42:54 +01:00
|
|
|
UpdManifest0 =
|
2019-02-26 10:33:20 +00:00
|
|
|
leveled_pmanifest:merge_snapshot(State#state.manifest, Manifest),
|
|
|
|
% Need to preserve the penciller's view of snapshots stored in
|
|
|
|
% the manifest
|
2022-05-24 09:42:54 +01:00
|
|
|
UpdManifest1 =
|
|
|
|
leveled_pmanifest:clear_pending(
|
|
|
|
UpdManifest0,
|
2022-05-24 09:53:27 +01:00
|
|
|
lists:usort(State#state.pending_removals),
|
2022-05-24 09:42:54 +01:00
|
|
|
State#state.maybe_release),
|
|
|
|
{noreply,
|
|
|
|
State#state{
|
|
|
|
manifest=UpdManifest1,
|
|
|
|
pending_removals = [],
|
|
|
|
maybe_release = false,
|
|
|
|
work_ongoing=false}}
|
2019-02-26 10:33:20 +00:00
|
|
|
end;
|
2016-10-12 17:12:49 +01:00
|
|
|
handle_cast({release_snapshot, Snapshot}, State) ->
|
2023-10-03 18:30:40 +01:00
|
|
|
Manifest0 =
|
|
|
|
leveled_pmanifest:release_snapshot(State#state.manifest, Snapshot),
|
Develop 3.1 d30update (#386)
* Mas i370 patch d (#383)
* Refactor penciller memory
In high-volume tests on large key-count clusters, so significant variation in the P0031 time has been seen:
TimeBucket PatchA
a.0ms_to_1ms 18554
b.1ms_to_2ms 51778
c.2ms_to_3ms 696
d.3ms_to_5ms 220
e.5ms_to_8ms 59
f.8ms_to_13ms 40
g.13ms_to_21ms 364
h.21ms_to_34ms 277
i.34ms_to_55ms 34
j.55ms_to_89ms 17
k.89ms_to_144ms 21
l.144ms_to_233ms 31
m.233ms_to_377ms 45
n.377ms_to_610ms 52
o.610ms_to_987ms 59
p.987ms_to_1597ms 55
q.1597ms_to_2684ms 54
r.2684ms_to_4281ms 29
s.4281ms_to_6965ms 7
t.6295ms_to_11246ms 1
It is unclear why this varies so much. The time to add to the cache appears to be minimal (but perhaps there is an issue with timing points in the code), whereas the time to add to the index is much more significant and variable. There is also variable time when the memory is rolled (although the actual activity here appears to be minimal.
The refactoring here is two-fold:
- tidy and simplify by keeping LoopState managed within handle_call, and add more helpful dialyzer specs;
- change the update to the index to be a simple extension of a list, rather than any conversion.
This alternative version of the pmem index in unit test is orders of magnitude faster to add - and is the same order of magnitude to check. Anticipation is that it may be more efficient in terms of memory changes.
* Compress SST index
Reduces the size of the leveled_sst index with two changes:
1 - Where there is a common prefix of tuple elements (e.g. Bucket) across the whole leveled_sst file - only the non-common part is indexed, and a function is used to compare.
2 - There is less "indexing" of the index i.e. only 1 in 16 keys are passed into the gb_trees part instead of 1 in 4
* Immediate hibernate
Reasons for delay in hibernate were not clear.
Straight after creation the process will not be in receipt of messages (must wait for the manifest to be updated), so better to hibernate now. This also means the log PC023 provides more accurate information.
* Refactor BIC
This patch avoids the following:
- repeated replacement of the same element in the BIC (via get_kvrange), by checking presence via GET before sing SET
- Stops re-reading of all elements to discover high modified date
Also there appears to have been a bug where a missing HMD for the file is required to add to the cache. However, now the cache may be erased without erasing the HMD. This means that the cache can never be rebuilt
* Use correct size in test results
erts_debug:flat_size/1 returns size in words (i.e. 8 bytes on 64-bit CPU) not bytes
* Don't change summary record
As it is persisted as part of the file write, any change to the summary record cannot be rolled back
* Clerk to prompt L0 write
Simplifies the logic if the clerk request work for the penciller prompts L0 writes as well as Manifest changes.
The advantage now is that if the penciller memory is full, and PUT load stops, the clerk should still be able to prompt persistence. the penciller can therefore make use of dead time this way
* Add push on journal compact
If there has been a backlog, followed by a quiet period - there may be a large ledger cache left unpushed. Journal compaction events are about once per hour, so the performance overhead of a false push should be minimal, with the advantage of clearing any backlog before load starts again.
This is only relevant to riak users with very off/full batch type workloads.
* Extend tests
To more consistently trigger all overload scenarios
* Fix range keys smaller than prefix
Can't make end key an empty binary in this case, as it may be bigger than any keys within the range, but will appear to be smaller.
Unit tests and ct tests added to expose the potential issue
* Tidy-up
- Remove penciller logs which are no longer called
- Get pclerk to only wait MIN_TIMEOUT after doing work, in case there is a backlog
- Remove update_levelzero_cache function as it is unique to handle_call of push_mem, and simple enough to be inline
- Alight testutil slow offer with standard slow offer used
* Tidy-up
Remove pre-otp20 references.
Reinstate the check that the starting pid is still active, this was added to tidy up shutdown.
Resolve failure to run on otp20 due to `-if` sttaement
* Tidy up
Using null rather then {null, Key} is potentially clearer as it is not a concern what they Key is in this case, and removes a comparison step from the leveled_codec:endkey_passed/2 function.
There were issues with coverage in eunit tests as the leveled_pclerk shut down. This prompted a general tidy of leveled_pclerk (remove passing of LoopState into internal functions, and add dialyzer specs.
* Remove R16 relic
* Further testing another issue
The StartKey must always be less than or equal to the prefix when the first N characters are stripped, but this is not true of the EndKey (for the query) which does not have to be between the FirstKey and the LastKey.
If the EndKey query does not match it must be greater than the Prefix (as otherwise it would not have been greater than the FirstKey - so set to null.
* Fix unit test
Unit test had a typo - and result interpretation had a misunderstanding.
* Code and spec tidy
Also look to the cover the situation when the FirstKey is the same as the Prefix with tests.
This is, in theory, not an issue as it is the EndKey for each sublist which is indexed in leveled_tree. However, guard against it mapping to null here, just in case there are dangers lurking (note that tests will still pass without `M > N` guard in place.
* Hibernate on BIC complete
There are three situations when the BIC becomes complete:
- In a file created as part of a merge the BIS is learned in the merge
- After startup, files below L1 learn the block cache through reads that happen to read the block, eventually the while cache will be read, unless...
- Either before/after the cache is complete, it can get whiped by a timeout after a get_sqn request (e.g. as prompted by a journal compaction) ... it will then be re-filled of the back of get/get-range requests.
In all these situations we want to hibernate after the BIC is fill - to reflect the fact that the LoopState should now be relatively stable, so it is a good point to GC and rationalise location of data.
Previously on the the first base was covered. Now all three are covered through the bic_complete message.
* Test all index keys have same term
This works functionally, but is not optimised (the term is replicated in the index)
* Summaries with same index term
If the summary index all have the same index term - only the object keys need to be indexes
* Simplify case statements
We either match the pattern of <<Prefix:N, Suffix>> or the answer should be null
* OK for M == N
If M = N for the first key, it will have a suffix of <<>>. This will match (as expected) a query Start Key of the sam size, and be smaller than any query Start Key that has the same prefix.
If the query Start Key does not match the prefix - it will be null - as it must be smaller than the Prefix (as other wise the query Start Key would be bigger than the Last Key).
The constraint of M > N was introduced before the *_prefix_filter functions were checking the prefix, to avoid issues. Now the prefix is being checked, then M == N is ok.
* Simplify
Correct the test to use a binary field in the range.
To avoid further issue, only apply filter when everything is a binary() type.
* Add test for head_only mode
When leveled is used as a tictacaae key store (in parallel mode), the keys will be head_only entries. Double check they are handled as expected like object keys
* Revert previous change - must support typed buckets
Add assertion to confirm worthwhile optimisation
* Add support for configurable cache multiple (#375)
* Mas i370 patch e (#385)
Improvement to monitoring for efficiency and improved readability of logs and stats.
As part of this, where possible, tried to avoid updating loop state on READ messages in leveled processes (as was the case when tracking stats within each process).
No performance benefits found with change, but improved stats has helped discover other potential gains.
2022-12-18 20:18:03 +00:00
|
|
|
leveled_log:log(p0003, [Snapshot]),
|
2017-01-13 18:23:57 +00:00
|
|
|
{noreply, State#state{manifest=Manifest0}};
|
2022-05-24 09:42:54 +01:00
|
|
|
handle_cast({confirm_delete, PDFN, FilePid}, State=#state{is_snapshot=Snap})
|
|
|
|
when Snap == false ->
|
|
|
|
% This is a two stage process. A file that is ready for deletion can be
|
|
|
|
% checked against the manifest to prompt the deletion, however it must also
|
|
|
|
% be removed from the manifest's list of pending deletes. This is only
|
|
|
|
% possible when the manifest is in control of the penciller not the clerk.
|
|
|
|
% When work is ongoing (i.e. the manifest is under control of the clerk),
|
|
|
|
% any removals from the manifest need to be stored temporarily (in
|
|
|
|
% pending_removals) until such time that the manifest is in control of the
|
|
|
|
% penciller and can be updated.
|
|
|
|
% The maybe_release boolean on state is used if any file is not ready to
|
|
|
|
% delete, and there is work ongoing. This will then trigger a check to
|
|
|
|
% ensure any timed out snapshots are released, in case this is the factor
|
|
|
|
% blocking the delete confirmation
|
|
|
|
% When an updated manifest is submitted by the clerk, the pending_removals
|
|
|
|
% will be cleared from pending using the maybe_release boolean
|
|
|
|
case leveled_pmanifest:ready_to_delete(State#state.manifest, PDFN) of
|
2017-01-15 00:52:43 +00:00
|
|
|
true ->
|
Develop 3.1 d30update (#386)
* Mas i370 patch d (#383)
* Refactor penciller memory
In high-volume tests on large key-count clusters, so significant variation in the P0031 time has been seen:
TimeBucket PatchA
a.0ms_to_1ms 18554
b.1ms_to_2ms 51778
c.2ms_to_3ms 696
d.3ms_to_5ms 220
e.5ms_to_8ms 59
f.8ms_to_13ms 40
g.13ms_to_21ms 364
h.21ms_to_34ms 277
i.34ms_to_55ms 34
j.55ms_to_89ms 17
k.89ms_to_144ms 21
l.144ms_to_233ms 31
m.233ms_to_377ms 45
n.377ms_to_610ms 52
o.610ms_to_987ms 59
p.987ms_to_1597ms 55
q.1597ms_to_2684ms 54
r.2684ms_to_4281ms 29
s.4281ms_to_6965ms 7
t.6295ms_to_11246ms 1
It is unclear why this varies so much. The time to add to the cache appears to be minimal (but perhaps there is an issue with timing points in the code), whereas the time to add to the index is much more significant and variable. There is also variable time when the memory is rolled (although the actual activity here appears to be minimal.
The refactoring here is two-fold:
- tidy and simplify by keeping LoopState managed within handle_call, and add more helpful dialyzer specs;
- change the update to the index to be a simple extension of a list, rather than any conversion.
This alternative version of the pmem index in unit test is orders of magnitude faster to add - and is the same order of magnitude to check. Anticipation is that it may be more efficient in terms of memory changes.
* Compress SST index
Reduces the size of the leveled_sst index with two changes:
1 - Where there is a common prefix of tuple elements (e.g. Bucket) across the whole leveled_sst file - only the non-common part is indexed, and a function is used to compare.
2 - There is less "indexing" of the index i.e. only 1 in 16 keys are passed into the gb_trees part instead of 1 in 4
* Immediate hibernate
Reasons for delay in hibernate were not clear.
Straight after creation the process will not be in receipt of messages (must wait for the manifest to be updated), so better to hibernate now. This also means the log PC023 provides more accurate information.
* Refactor BIC
This patch avoids the following:
- repeated replacement of the same element in the BIC (via get_kvrange), by checking presence via GET before sing SET
- Stops re-reading of all elements to discover high modified date
Also there appears to have been a bug where a missing HMD for the file is required to add to the cache. However, now the cache may be erased without erasing the HMD. This means that the cache can never be rebuilt
* Use correct size in test results
erts_debug:flat_size/1 returns size in words (i.e. 8 bytes on 64-bit CPU) not bytes
* Don't change summary record
As it is persisted as part of the file write, any change to the summary record cannot be rolled back
* Clerk to prompt L0 write
Simplifies the logic if the clerk request work for the penciller prompts L0 writes as well as Manifest changes.
The advantage now is that if the penciller memory is full, and PUT load stops, the clerk should still be able to prompt persistence. the penciller can therefore make use of dead time this way
* Add push on journal compact
If there has been a backlog, followed by a quiet period - there may be a large ledger cache left unpushed. Journal compaction events are about once per hour, so the performance overhead of a false push should be minimal, with the advantage of clearing any backlog before load starts again.
This is only relevant to riak users with very off/full batch type workloads.
* Extend tests
To more consistently trigger all overload scenarios
* Fix range keys smaller than prefix
Can't make end key an empty binary in this case, as it may be bigger than any keys within the range, but will appear to be smaller.
Unit tests and ct tests added to expose the potential issue
* Tidy-up
- Remove penciller logs which are no longer called
- Get pclerk to only wait MIN_TIMEOUT after doing work, in case there is a backlog
- Remove update_levelzero_cache function as it is unique to handle_call of push_mem, and simple enough to be inline
- Alight testutil slow offer with standard slow offer used
* Tidy-up
Remove pre-otp20 references.
Reinstate the check that the starting pid is still active, this was added to tidy up shutdown.
Resolve failure to run on otp20 due to `-if` sttaement
* Tidy up
Using null rather then {null, Key} is potentially clearer as it is not a concern what they Key is in this case, and removes a comparison step from the leveled_codec:endkey_passed/2 function.
There were issues with coverage in eunit tests as the leveled_pclerk shut down. This prompted a general tidy of leveled_pclerk (remove passing of LoopState into internal functions, and add dialyzer specs.
* Remove R16 relic
* Further testing another issue
The StartKey must always be less than or equal to the prefix when the first N characters are stripped, but this is not true of the EndKey (for the query) which does not have to be between the FirstKey and the LastKey.
If the EndKey query does not match it must be greater than the Prefix (as otherwise it would not have been greater than the FirstKey - so set to null.
* Fix unit test
Unit test had a typo - and result interpretation had a misunderstanding.
* Code and spec tidy
Also look to the cover the situation when the FirstKey is the same as the Prefix with tests.
This is, in theory, not an issue as it is the EndKey for each sublist which is indexed in leveled_tree. However, guard against it mapping to null here, just in case there are dangers lurking (note that tests will still pass without `M > N` guard in place.
* Hibernate on BIC complete
There are three situations when the BIC becomes complete:
- In a file created as part of a merge the BIS is learned in the merge
- After startup, files below L1 learn the block cache through reads that happen to read the block, eventually the while cache will be read, unless...
- Either before/after the cache is complete, it can get whiped by a timeout after a get_sqn request (e.g. as prompted by a journal compaction) ... it will then be re-filled of the back of get/get-range requests.
In all these situations we want to hibernate after the BIC is fill - to reflect the fact that the LoopState should now be relatively stable, so it is a good point to GC and rationalise location of data.
Previously on the the first base was covered. Now all three are covered through the bic_complete message.
* Test all index keys have same term
This works functionally, but is not optimised (the term is replicated in the index)
* Summaries with same index term
If the summary index all have the same index term - only the object keys need to be indexes
* Simplify case statements
We either match the pattern of <<Prefix:N, Suffix>> or the answer should be null
* OK for M == N
If M = N for the first key, it will have a suffix of <<>>. This will match (as expected) a query Start Key of the sam size, and be smaller than any query Start Key that has the same prefix.
If the query Start Key does not match the prefix - it will be null - as it must be smaller than the Prefix (as other wise the query Start Key would be bigger than the Last Key).
The constraint of M > N was introduced before the *_prefix_filter functions were checking the prefix, to avoid issues. Now the prefix is being checked, then M == N is ok.
* Simplify
Correct the test to use a binary field in the range.
To avoid further issue, only apply filter when everything is a binary() type.
* Add test for head_only mode
When leveled is used as a tictacaae key store (in parallel mode), the keys will be head_only entries. Double check they are handled as expected like object keys
* Revert previous change - must support typed buckets
Add assertion to confirm worthwhile optimisation
* Add support for configurable cache multiple (#375)
* Mas i370 patch e (#385)
Improvement to monitoring for efficiency and improved readability of logs and stats.
As part of this, where possible, tried to avoid updating loop state on READ messages in leveled processes (as was the case when tracking stats within each process).
No performance benefits found with change, but improved stats has helped discover other potential gains.
2022-12-18 20:18:03 +00:00
|
|
|
leveled_log:log(p0005, [PDFN]),
|
2022-05-24 09:42:54 +01:00
|
|
|
ok = leveled_sst:sst_deleteconfirmed(FilePid),
|
|
|
|
case State#state.work_ongoing of
|
|
|
|
true ->
|
|
|
|
{noreply,
|
|
|
|
State#state{
|
2022-05-24 09:53:27 +01:00
|
|
|
pending_removals =
|
|
|
|
[PDFN|State#state.pending_removals]}};
|
2022-05-24 09:42:54 +01:00
|
|
|
false ->
|
|
|
|
UpdManifest =
|
|
|
|
leveled_pmanifest:clear_pending(
|
|
|
|
State#state.manifest,
|
|
|
|
[PDFN],
|
|
|
|
false),
|
|
|
|
{noreply,
|
|
|
|
State#state{manifest = UpdManifest}}
|
|
|
|
end;
|
|
|
|
false ->
|
|
|
|
case State#state.work_ongoing of
|
|
|
|
true ->
|
|
|
|
{noreply, State#state{maybe_release = true}};
|
|
|
|
false ->
|
|
|
|
UpdManifest =
|
|
|
|
leveled_pmanifest:clear_pending(
|
|
|
|
State#state.manifest,
|
|
|
|
[],
|
|
|
|
true),
|
|
|
|
{noreply,
|
|
|
|
State#state{manifest = UpdManifest}}
|
|
|
|
end
|
2016-11-05 11:22:27 +00:00
|
|
|
end;
|
2017-11-28 01:19:30 +00:00
|
|
|
handle_cast({levelzero_complete, FN, StartKey, EndKey, Bloom}, State) ->
|
Develop 3.1 d30update (#386)
* Mas i370 patch d (#383)
* Refactor penciller memory
In high-volume tests on large key-count clusters, so significant variation in the P0031 time has been seen:
TimeBucket PatchA
a.0ms_to_1ms 18554
b.1ms_to_2ms 51778
c.2ms_to_3ms 696
d.3ms_to_5ms 220
e.5ms_to_8ms 59
f.8ms_to_13ms 40
g.13ms_to_21ms 364
h.21ms_to_34ms 277
i.34ms_to_55ms 34
j.55ms_to_89ms 17
k.89ms_to_144ms 21
l.144ms_to_233ms 31
m.233ms_to_377ms 45
n.377ms_to_610ms 52
o.610ms_to_987ms 59
p.987ms_to_1597ms 55
q.1597ms_to_2684ms 54
r.2684ms_to_4281ms 29
s.4281ms_to_6965ms 7
t.6295ms_to_11246ms 1
It is unclear why this varies so much. The time to add to the cache appears to be minimal (but perhaps there is an issue with timing points in the code), whereas the time to add to the index is much more significant and variable. There is also variable time when the memory is rolled (although the actual activity here appears to be minimal.
The refactoring here is two-fold:
- tidy and simplify by keeping LoopState managed within handle_call, and add more helpful dialyzer specs;
- change the update to the index to be a simple extension of a list, rather than any conversion.
This alternative version of the pmem index in unit test is orders of magnitude faster to add - and is the same order of magnitude to check. Anticipation is that it may be more efficient in terms of memory changes.
* Compress SST index
Reduces the size of the leveled_sst index with two changes:
1 - Where there is a common prefix of tuple elements (e.g. Bucket) across the whole leveled_sst file - only the non-common part is indexed, and a function is used to compare.
2 - There is less "indexing" of the index i.e. only 1 in 16 keys are passed into the gb_trees part instead of 1 in 4
* Immediate hibernate
Reasons for delay in hibernate were not clear.
Straight after creation the process will not be in receipt of messages (must wait for the manifest to be updated), so better to hibernate now. This also means the log PC023 provides more accurate information.
* Refactor BIC
This patch avoids the following:
- repeated replacement of the same element in the BIC (via get_kvrange), by checking presence via GET before sing SET
- Stops re-reading of all elements to discover high modified date
Also there appears to have been a bug where a missing HMD for the file is required to add to the cache. However, now the cache may be erased without erasing the HMD. This means that the cache can never be rebuilt
* Use correct size in test results
erts_debug:flat_size/1 returns size in words (i.e. 8 bytes on 64-bit CPU) not bytes
* Don't change summary record
As it is persisted as part of the file write, any change to the summary record cannot be rolled back
* Clerk to prompt L0 write
Simplifies the logic if the clerk request work for the penciller prompts L0 writes as well as Manifest changes.
The advantage now is that if the penciller memory is full, and PUT load stops, the clerk should still be able to prompt persistence. the penciller can therefore make use of dead time this way
* Add push on journal compact
If there has been a backlog, followed by a quiet period - there may be a large ledger cache left unpushed. Journal compaction events are about once per hour, so the performance overhead of a false push should be minimal, with the advantage of clearing any backlog before load starts again.
This is only relevant to riak users with very off/full batch type workloads.
* Extend tests
To more consistently trigger all overload scenarios
* Fix range keys smaller than prefix
Can't make end key an empty binary in this case, as it may be bigger than any keys within the range, but will appear to be smaller.
Unit tests and ct tests added to expose the potential issue
* Tidy-up
- Remove penciller logs which are no longer called
- Get pclerk to only wait MIN_TIMEOUT after doing work, in case there is a backlog
- Remove update_levelzero_cache function as it is unique to handle_call of push_mem, and simple enough to be inline
- Alight testutil slow offer with standard slow offer used
* Tidy-up
Remove pre-otp20 references.
Reinstate the check that the starting pid is still active, this was added to tidy up shutdown.
Resolve failure to run on otp20 due to `-if` sttaement
* Tidy up
Using null rather then {null, Key} is potentially clearer as it is not a concern what they Key is in this case, and removes a comparison step from the leveled_codec:endkey_passed/2 function.
There were issues with coverage in eunit tests as the leveled_pclerk shut down. This prompted a general tidy of leveled_pclerk (remove passing of LoopState into internal functions, and add dialyzer specs.
* Remove R16 relic
* Further testing another issue
The StartKey must always be less than or equal to the prefix when the first N characters are stripped, but this is not true of the EndKey (for the query) which does not have to be between the FirstKey and the LastKey.
If the EndKey query does not match it must be greater than the Prefix (as otherwise it would not have been greater than the FirstKey - so set to null.
* Fix unit test
Unit test had a typo - and result interpretation had a misunderstanding.
* Code and spec tidy
Also look to the cover the situation when the FirstKey is the same as the Prefix with tests.
This is, in theory, not an issue as it is the EndKey for each sublist which is indexed in leveled_tree. However, guard against it mapping to null here, just in case there are dangers lurking (note that tests will still pass without `M > N` guard in place.
* Hibernate on BIC complete
There are three situations when the BIC becomes complete:
- In a file created as part of a merge the BIS is learned in the merge
- After startup, files below L1 learn the block cache through reads that happen to read the block, eventually the while cache will be read, unless...
- Either before/after the cache is complete, it can get whiped by a timeout after a get_sqn request (e.g. as prompted by a journal compaction) ... it will then be re-filled of the back of get/get-range requests.
In all these situations we want to hibernate after the BIC is fill - to reflect the fact that the LoopState should now be relatively stable, so it is a good point to GC and rationalise location of data.
Previously on the the first base was covered. Now all three are covered through the bic_complete message.
* Test all index keys have same term
This works functionally, but is not optimised (the term is replicated in the index)
* Summaries with same index term
If the summary index all have the same index term - only the object keys need to be indexes
* Simplify case statements
We either match the pattern of <<Prefix:N, Suffix>> or the answer should be null
* OK for M == N
If M = N for the first key, it will have a suffix of <<>>. This will match (as expected) a query Start Key of the sam size, and be smaller than any query Start Key that has the same prefix.
If the query Start Key does not match the prefix - it will be null - as it must be smaller than the Prefix (as other wise the query Start Key would be bigger than the Last Key).
The constraint of M > N was introduced before the *_prefix_filter functions were checking the prefix, to avoid issues. Now the prefix is being checked, then M == N is ok.
* Simplify
Correct the test to use a binary field in the range.
To avoid further issue, only apply filter when everything is a binary() type.
* Add test for head_only mode
When leveled is used as a tictacaae key store (in parallel mode), the keys will be head_only entries. Double check they are handled as expected like object keys
* Revert previous change - must support typed buckets
Add assertion to confirm worthwhile optimisation
* Add support for configurable cache multiple (#375)
* Mas i370 patch e (#385)
Improvement to monitoring for efficiency and improved readability of logs and stats.
As part of this, where possible, tried to avoid updating loop state on READ messages in leveled processes (as was the case when tracking stats within each process).
No performance benefits found with change, but improved stats has helped discover other potential gains.
2022-12-18 20:18:03 +00:00
|
|
|
leveled_log:log(p0029, []),
|
2016-11-05 11:22:27 +00:00
|
|
|
ManEntry = #manifest_entry{start_key=StartKey,
|
|
|
|
end_key=EndKey,
|
|
|
|
owner=State#state.levelzero_constructor,
|
2017-11-28 01:19:30 +00:00
|
|
|
filename=FN,
|
|
|
|
bloom=Bloom},
|
2017-01-17 14:11:50 +00:00
|
|
|
ManifestSQN = leveled_pmanifest:get_manifest_sqn(State#state.manifest) + 1,
|
|
|
|
UpdMan = leveled_pmanifest:insert_manifest_entry(State#state.manifest,
|
2017-11-28 01:19:30 +00:00
|
|
|
ManifestSQN,
|
|
|
|
0,
|
|
|
|
ManEntry),
|
2016-11-05 11:22:27 +00:00
|
|
|
% Prompt clerk to ask about work - do this for every L0 roll
|
|
|
|
ok = leveled_pclerk:clerk_prompt(State#state.clerk),
|
|
|
|
{noreply, State#state{levelzero_cache=[],
|
Develop 3.1 d30update (#386)
* Mas i370 patch d (#383)
* Refactor penciller memory
In high-volume tests on large key-count clusters, so significant variation in the P0031 time has been seen:
TimeBucket PatchA
a.0ms_to_1ms 18554
b.1ms_to_2ms 51778
c.2ms_to_3ms 696
d.3ms_to_5ms 220
e.5ms_to_8ms 59
f.8ms_to_13ms 40
g.13ms_to_21ms 364
h.21ms_to_34ms 277
i.34ms_to_55ms 34
j.55ms_to_89ms 17
k.89ms_to_144ms 21
l.144ms_to_233ms 31
m.233ms_to_377ms 45
n.377ms_to_610ms 52
o.610ms_to_987ms 59
p.987ms_to_1597ms 55
q.1597ms_to_2684ms 54
r.2684ms_to_4281ms 29
s.4281ms_to_6965ms 7
t.6295ms_to_11246ms 1
It is unclear why this varies so much. The time to add to the cache appears to be minimal (but perhaps there is an issue with timing points in the code), whereas the time to add to the index is much more significant and variable. There is also variable time when the memory is rolled (although the actual activity here appears to be minimal.
The refactoring here is two-fold:
- tidy and simplify by keeping LoopState managed within handle_call, and add more helpful dialyzer specs;
- change the update to the index to be a simple extension of a list, rather than any conversion.
This alternative version of the pmem index in unit test is orders of magnitude faster to add - and is the same order of magnitude to check. Anticipation is that it may be more efficient in terms of memory changes.
* Compress SST index
Reduces the size of the leveled_sst index with two changes:
1 - Where there is a common prefix of tuple elements (e.g. Bucket) across the whole leveled_sst file - only the non-common part is indexed, and a function is used to compare.
2 - There is less "indexing" of the index i.e. only 1 in 16 keys are passed into the gb_trees part instead of 1 in 4
* Immediate hibernate
Reasons for delay in hibernate were not clear.
Straight after creation the process will not be in receipt of messages (must wait for the manifest to be updated), so better to hibernate now. This also means the log PC023 provides more accurate information.
* Refactor BIC
This patch avoids the following:
- repeated replacement of the same element in the BIC (via get_kvrange), by checking presence via GET before sing SET
- Stops re-reading of all elements to discover high modified date
Also there appears to have been a bug where a missing HMD for the file is required to add to the cache. However, now the cache may be erased without erasing the HMD. This means that the cache can never be rebuilt
* Use correct size in test results
erts_debug:flat_size/1 returns size in words (i.e. 8 bytes on 64-bit CPU) not bytes
* Don't change summary record
As it is persisted as part of the file write, any change to the summary record cannot be rolled back
* Clerk to prompt L0 write
Simplifies the logic if the clerk request work for the penciller prompts L0 writes as well as Manifest changes.
The advantage now is that if the penciller memory is full, and PUT load stops, the clerk should still be able to prompt persistence. the penciller can therefore make use of dead time this way
* Add push on journal compact
If there has been a backlog, followed by a quiet period - there may be a large ledger cache left unpushed. Journal compaction events are about once per hour, so the performance overhead of a false push should be minimal, with the advantage of clearing any backlog before load starts again.
This is only relevant to riak users with very off/full batch type workloads.
* Extend tests
To more consistently trigger all overload scenarios
* Fix range keys smaller than prefix
Can't make end key an empty binary in this case, as it may be bigger than any keys within the range, but will appear to be smaller.
Unit tests and ct tests added to expose the potential issue
* Tidy-up
- Remove penciller logs which are no longer called
- Get pclerk to only wait MIN_TIMEOUT after doing work, in case there is a backlog
- Remove update_levelzero_cache function as it is unique to handle_call of push_mem, and simple enough to be inline
- Alight testutil slow offer with standard slow offer used
* Tidy-up
Remove pre-otp20 references.
Reinstate the check that the starting pid is still active, this was added to tidy up shutdown.
Resolve failure to run on otp20 due to `-if` sttaement
* Tidy up
Using null rather then {null, Key} is potentially clearer as it is not a concern what they Key is in this case, and removes a comparison step from the leveled_codec:endkey_passed/2 function.
There were issues with coverage in eunit tests as the leveled_pclerk shut down. This prompted a general tidy of leveled_pclerk (remove passing of LoopState into internal functions, and add dialyzer specs.
* Remove R16 relic
* Further testing another issue
The StartKey must always be less than or equal to the prefix when the first N characters are stripped, but this is not true of the EndKey (for the query) which does not have to be between the FirstKey and the LastKey.
If the EndKey query does not match it must be greater than the Prefix (as otherwise it would not have been greater than the FirstKey - so set to null.
* Fix unit test
Unit test had a typo - and result interpretation had a misunderstanding.
* Code and spec tidy
Also look to the cover the situation when the FirstKey is the same as the Prefix with tests.
This is, in theory, not an issue as it is the EndKey for each sublist which is indexed in leveled_tree. However, guard against it mapping to null here, just in case there are dangers lurking (note that tests will still pass without `M > N` guard in place.
* Hibernate on BIC complete
There are three situations when the BIC becomes complete:
- In a file created as part of a merge the BIS is learned in the merge
- After startup, files below L1 learn the block cache through reads that happen to read the block, eventually the while cache will be read, unless...
- Either before/after the cache is complete, it can get whiped by a timeout after a get_sqn request (e.g. as prompted by a journal compaction) ... it will then be re-filled of the back of get/get-range requests.
In all these situations we want to hibernate after the BIC is fill - to reflect the fact that the LoopState should now be relatively stable, so it is a good point to GC and rationalise location of data.
Previously on the the first base was covered. Now all three are covered through the bic_complete message.
* Test all index keys have same term
This works functionally, but is not optimised (the term is replicated in the index)
* Summaries with same index term
If the summary index all have the same index term - only the object keys need to be indexes
* Simplify case statements
We either match the pattern of <<Prefix:N, Suffix>> or the answer should be null
* OK for M == N
If M = N for the first key, it will have a suffix of <<>>. This will match (as expected) a query Start Key of the sam size, and be smaller than any query Start Key that has the same prefix.
If the query Start Key does not match the prefix - it will be null - as it must be smaller than the Prefix (as other wise the query Start Key would be bigger than the Last Key).
The constraint of M > N was introduced before the *_prefix_filter functions were checking the prefix, to avoid issues. Now the prefix is being checked, then M == N is ok.
* Simplify
Correct the test to use a binary field in the range.
To avoid further issue, only apply filter when everything is a binary() type.
* Add test for head_only mode
When leveled is used as a tictacaae key store (in parallel mode), the keys will be head_only entries. Double check they are handled as expected like object keys
* Revert previous change - must support typed buckets
Add assertion to confirm worthwhile optimisation
* Add support for configurable cache multiple (#375)
* Mas i370 patch e (#385)
Improvement to monitoring for efficiency and improved readability of logs and stats.
As part of this, where possible, tried to avoid updating loop state on READ messages in leveled processes (as was the case when tracking stats within each process).
No performance benefits found with change, but improved stats has helped discover other potential gains.
2022-12-18 20:18:03 +00:00
|
|
|
levelzero_index=[],
|
2016-11-05 11:22:27 +00:00
|
|
|
levelzero_pending=false,
|
|
|
|
levelzero_constructor=undefined,
|
|
|
|
levelzero_size=0,
|
2016-11-05 12:03:21 +00:00
|
|
|
manifest=UpdMan,
|
2017-01-15 00:52:43 +00:00
|
|
|
persisted_sqn=State#state.ledger_sqn}};
|
|
|
|
handle_cast(work_for_clerk, State) ->
|
2023-02-08 10:54:56 +00:00
|
|
|
case {(State#state.levelzero_pending or State#state.work_ongoing),
|
Develop 3.1 d30update (#386)
* Mas i370 patch d (#383)
* Refactor penciller memory
In high-volume tests on large key-count clusters, so significant variation in the P0031 time has been seen:
TimeBucket PatchA
a.0ms_to_1ms 18554
b.1ms_to_2ms 51778
c.2ms_to_3ms 696
d.3ms_to_5ms 220
e.5ms_to_8ms 59
f.8ms_to_13ms 40
g.13ms_to_21ms 364
h.21ms_to_34ms 277
i.34ms_to_55ms 34
j.55ms_to_89ms 17
k.89ms_to_144ms 21
l.144ms_to_233ms 31
m.233ms_to_377ms 45
n.377ms_to_610ms 52
o.610ms_to_987ms 59
p.987ms_to_1597ms 55
q.1597ms_to_2684ms 54
r.2684ms_to_4281ms 29
s.4281ms_to_6965ms 7
t.6295ms_to_11246ms 1
It is unclear why this varies so much. The time to add to the cache appears to be minimal (but perhaps there is an issue with timing points in the code), whereas the time to add to the index is much more significant and variable. There is also variable time when the memory is rolled (although the actual activity here appears to be minimal.
The refactoring here is two-fold:
- tidy and simplify by keeping LoopState managed within handle_call, and add more helpful dialyzer specs;
- change the update to the index to be a simple extension of a list, rather than any conversion.
This alternative version of the pmem index in unit test is orders of magnitude faster to add - and is the same order of magnitude to check. Anticipation is that it may be more efficient in terms of memory changes.
* Compress SST index
Reduces the size of the leveled_sst index with two changes:
1 - Where there is a common prefix of tuple elements (e.g. Bucket) across the whole leveled_sst file - only the non-common part is indexed, and a function is used to compare.
2 - There is less "indexing" of the index i.e. only 1 in 16 keys are passed into the gb_trees part instead of 1 in 4
* Immediate hibernate
Reasons for delay in hibernate were not clear.
Straight after creation the process will not be in receipt of messages (must wait for the manifest to be updated), so better to hibernate now. This also means the log PC023 provides more accurate information.
* Refactor BIC
This patch avoids the following:
- repeated replacement of the same element in the BIC (via get_kvrange), by checking presence via GET before sing SET
- Stops re-reading of all elements to discover high modified date
Also there appears to have been a bug where a missing HMD for the file is required to add to the cache. However, now the cache may be erased without erasing the HMD. This means that the cache can never be rebuilt
* Use correct size in test results
erts_debug:flat_size/1 returns size in words (i.e. 8 bytes on 64-bit CPU) not bytes
* Don't change summary record
As it is persisted as part of the file write, any change to the summary record cannot be rolled back
* Clerk to prompt L0 write
Simplifies the logic if the clerk request work for the penciller prompts L0 writes as well as Manifest changes.
The advantage now is that if the penciller memory is full, and PUT load stops, the clerk should still be able to prompt persistence. the penciller can therefore make use of dead time this way
* Add push on journal compact
If there has been a backlog, followed by a quiet period - there may be a large ledger cache left unpushed. Journal compaction events are about once per hour, so the performance overhead of a false push should be minimal, with the advantage of clearing any backlog before load starts again.
This is only relevant to riak users with very off/full batch type workloads.
* Extend tests
To more consistently trigger all overload scenarios
* Fix range keys smaller than prefix
Can't make end key an empty binary in this case, as it may be bigger than any keys within the range, but will appear to be smaller.
Unit tests and ct tests added to expose the potential issue
* Tidy-up
- Remove penciller logs which are no longer called
- Get pclerk to only wait MIN_TIMEOUT after doing work, in case there is a backlog
- Remove update_levelzero_cache function as it is unique to handle_call of push_mem, and simple enough to be inline
- Alight testutil slow offer with standard slow offer used
* Tidy-up
Remove pre-otp20 references.
Reinstate the check that the starting pid is still active, this was added to tidy up shutdown.
Resolve failure to run on otp20 due to `-if` sttaement
* Tidy up
Using null rather then {null, Key} is potentially clearer as it is not a concern what they Key is in this case, and removes a comparison step from the leveled_codec:endkey_passed/2 function.
There were issues with coverage in eunit tests as the leveled_pclerk shut down. This prompted a general tidy of leveled_pclerk (remove passing of LoopState into internal functions, and add dialyzer specs.
* Remove R16 relic
* Further testing another issue
The StartKey must always be less than or equal to the prefix when the first N characters are stripped, but this is not true of the EndKey (for the query) which does not have to be between the FirstKey and the LastKey.
If the EndKey query does not match it must be greater than the Prefix (as otherwise it would not have been greater than the FirstKey - so set to null.
* Fix unit test
Unit test had a typo - and result interpretation had a misunderstanding.
* Code and spec tidy
Also look to the cover the situation when the FirstKey is the same as the Prefix with tests.
This is, in theory, not an issue as it is the EndKey for each sublist which is indexed in leveled_tree. However, guard against it mapping to null here, just in case there are dangers lurking (note that tests will still pass without `M > N` guard in place.
* Hibernate on BIC complete
There are three situations when the BIC becomes complete:
- In a file created as part of a merge the BIS is learned in the merge
- After startup, files below L1 learn the block cache through reads that happen to read the block, eventually the while cache will be read, unless...
- Either before/after the cache is complete, it can get whiped by a timeout after a get_sqn request (e.g. as prompted by a journal compaction) ... it will then be re-filled of the back of get/get-range requests.
In all these situations we want to hibernate after the BIC is fill - to reflect the fact that the LoopState should now be relatively stable, so it is a good point to GC and rationalise location of data.
Previously on the the first base was covered. Now all three are covered through the bic_complete message.
* Test all index keys have same term
This works functionally, but is not optimised (the term is replicated in the index)
* Summaries with same index term
If the summary index all have the same index term - only the object keys need to be indexes
* Simplify case statements
We either match the pattern of <<Prefix:N, Suffix>> or the answer should be null
* OK for M == N
If M = N for the first key, it will have a suffix of <<>>. This will match (as expected) a query Start Key of the sam size, and be smaller than any query Start Key that has the same prefix.
If the query Start Key does not match the prefix - it will be null - as it must be smaller than the Prefix (as other wise the query Start Key would be bigger than the Last Key).
The constraint of M > N was introduced before the *_prefix_filter functions were checking the prefix, to avoid issues. Now the prefix is being checked, then M == N is ok.
* Simplify
Correct the test to use a binary field in the range.
To avoid further issue, only apply filter when everything is a binary() type.
* Add test for head_only mode
When leveled is used as a tictacaae key store (in parallel mode), the keys will be head_only entries. Double check they are handled as expected like object keys
* Revert previous change - must support typed buckets
Add assertion to confirm worthwhile optimisation
* Add support for configurable cache multiple (#375)
* Mas i370 patch e (#385)
Improvement to monitoring for efficiency and improved readability of logs and stats.
As part of this, where possible, tried to avoid updating loop state on READ messages in leveled processes (as was the case when tracking stats within each process).
No performance benefits found with change, but improved stats has helped discover other potential gains.
2022-12-18 20:18:03 +00:00
|
|
|
leveled_pmanifest:levelzero_present(State#state.manifest)} of
|
2023-02-08 10:54:56 +00:00
|
|
|
{true, _L0Present} ->
|
|
|
|
% Work is blocked by ongoing activity
|
|
|
|
{noreply, State};
|
|
|
|
{false, true} ->
|
|
|
|
% If L0 present, and no work ongoing - dropping L0 to L1 is the
|
|
|
|
% priority
|
|
|
|
ok = leveled_pclerk:clerk_push(
|
|
|
|
State#state.clerk, {0, State#state.manifest}),
|
|
|
|
{noreply, State#state{work_ongoing=true}};
|
|
|
|
{false, false} ->
|
|
|
|
% No impediment to work - see what other work may be required
|
|
|
|
% See if the in-memory cache requires rolling now
|
Develop 3.1 d30update (#386)
* Mas i370 patch d (#383)
* Refactor penciller memory
In high-volume tests on large key-count clusters, so significant variation in the P0031 time has been seen:
TimeBucket PatchA
a.0ms_to_1ms 18554
b.1ms_to_2ms 51778
c.2ms_to_3ms 696
d.3ms_to_5ms 220
e.5ms_to_8ms 59
f.8ms_to_13ms 40
g.13ms_to_21ms 364
h.21ms_to_34ms 277
i.34ms_to_55ms 34
j.55ms_to_89ms 17
k.89ms_to_144ms 21
l.144ms_to_233ms 31
m.233ms_to_377ms 45
n.377ms_to_610ms 52
o.610ms_to_987ms 59
p.987ms_to_1597ms 55
q.1597ms_to_2684ms 54
r.2684ms_to_4281ms 29
s.4281ms_to_6965ms 7
t.6295ms_to_11246ms 1
It is unclear why this varies so much. The time to add to the cache appears to be minimal (but perhaps there is an issue with timing points in the code), whereas the time to add to the index is much more significant and variable. There is also variable time when the memory is rolled (although the actual activity here appears to be minimal.
The refactoring here is two-fold:
- tidy and simplify by keeping LoopState managed within handle_call, and add more helpful dialyzer specs;
- change the update to the index to be a simple extension of a list, rather than any conversion.
This alternative version of the pmem index in unit test is orders of magnitude faster to add - and is the same order of magnitude to check. Anticipation is that it may be more efficient in terms of memory changes.
* Compress SST index
Reduces the size of the leveled_sst index with two changes:
1 - Where there is a common prefix of tuple elements (e.g. Bucket) across the whole leveled_sst file - only the non-common part is indexed, and a function is used to compare.
2 - There is less "indexing" of the index i.e. only 1 in 16 keys are passed into the gb_trees part instead of 1 in 4
* Immediate hibernate
Reasons for delay in hibernate were not clear.
Straight after creation the process will not be in receipt of messages (must wait for the manifest to be updated), so better to hibernate now. This also means the log PC023 provides more accurate information.
* Refactor BIC
This patch avoids the following:
- repeated replacement of the same element in the BIC (via get_kvrange), by checking presence via GET before sing SET
- Stops re-reading of all elements to discover high modified date
Also there appears to have been a bug where a missing HMD for the file is required to add to the cache. However, now the cache may be erased without erasing the HMD. This means that the cache can never be rebuilt
* Use correct size in test results
erts_debug:flat_size/1 returns size in words (i.e. 8 bytes on 64-bit CPU) not bytes
* Don't change summary record
As it is persisted as part of the file write, any change to the summary record cannot be rolled back
* Clerk to prompt L0 write
Simplifies the logic if the clerk request work for the penciller prompts L0 writes as well as Manifest changes.
The advantage now is that if the penciller memory is full, and PUT load stops, the clerk should still be able to prompt persistence. the penciller can therefore make use of dead time this way
* Add push on journal compact
If there has been a backlog, followed by a quiet period - there may be a large ledger cache left unpushed. Journal compaction events are about once per hour, so the performance overhead of a false push should be minimal, with the advantage of clearing any backlog before load starts again.
This is only relevant to riak users with very off/full batch type workloads.
* Extend tests
To more consistently trigger all overload scenarios
* Fix range keys smaller than prefix
Can't make end key an empty binary in this case, as it may be bigger than any keys within the range, but will appear to be smaller.
Unit tests and ct tests added to expose the potential issue
* Tidy-up
- Remove penciller logs which are no longer called
- Get pclerk to only wait MIN_TIMEOUT after doing work, in case there is a backlog
- Remove update_levelzero_cache function as it is unique to handle_call of push_mem, and simple enough to be inline
- Alight testutil slow offer with standard slow offer used
* Tidy-up
Remove pre-otp20 references.
Reinstate the check that the starting pid is still active, this was added to tidy up shutdown.
Resolve failure to run on otp20 due to `-if` sttaement
* Tidy up
Using null rather then {null, Key} is potentially clearer as it is not a concern what they Key is in this case, and removes a comparison step from the leveled_codec:endkey_passed/2 function.
There were issues with coverage in eunit tests as the leveled_pclerk shut down. This prompted a general tidy of leveled_pclerk (remove passing of LoopState into internal functions, and add dialyzer specs.
* Remove R16 relic
* Further testing another issue
The StartKey must always be less than or equal to the prefix when the first N characters are stripped, but this is not true of the EndKey (for the query) which does not have to be between the FirstKey and the LastKey.
If the EndKey query does not match it must be greater than the Prefix (as otherwise it would not have been greater than the FirstKey - so set to null.
* Fix unit test
Unit test had a typo - and result interpretation had a misunderstanding.
* Code and spec tidy
Also look to the cover the situation when the FirstKey is the same as the Prefix with tests.
This is, in theory, not an issue as it is the EndKey for each sublist which is indexed in leveled_tree. However, guard against it mapping to null here, just in case there are dangers lurking (note that tests will still pass without `M > N` guard in place.
* Hibernate on BIC complete
There are three situations when the BIC becomes complete:
- In a file created as part of a merge the BIS is learned in the merge
- After startup, files below L1 learn the block cache through reads that happen to read the block, eventually the while cache will be read, unless...
- Either before/after the cache is complete, it can get whiped by a timeout after a get_sqn request (e.g. as prompted by a journal compaction) ... it will then be re-filled of the back of get/get-range requests.
In all these situations we want to hibernate after the BIC is fill - to reflect the fact that the LoopState should now be relatively stable, so it is a good point to GC and rationalise location of data.
Previously on the the first base was covered. Now all three are covered through the bic_complete message.
* Test all index keys have same term
This works functionally, but is not optimised (the term is replicated in the index)
* Summaries with same index term
If the summary index all have the same index term - only the object keys need to be indexes
* Simplify case statements
We either match the pattern of <<Prefix:N, Suffix>> or the answer should be null
* OK for M == N
If M = N for the first key, it will have a suffix of <<>>. This will match (as expected) a query Start Key of the sam size, and be smaller than any query Start Key that has the same prefix.
If the query Start Key does not match the prefix - it will be null - as it must be smaller than the Prefix (as other wise the query Start Key would be bigger than the Last Key).
The constraint of M > N was introduced before the *_prefix_filter functions were checking the prefix, to avoid issues. Now the prefix is being checked, then M == N is ok.
* Simplify
Correct the test to use a binary field in the range.
To avoid further issue, only apply filter when everything is a binary() type.
* Add test for head_only mode
When leveled is used as a tictacaae key store (in parallel mode), the keys will be head_only entries. Double check they are handled as expected like object keys
* Revert previous change - must support typed buckets
Add assertion to confirm worthwhile optimisation
* Add support for configurable cache multiple (#375)
* Mas i370 patch e (#385)
Improvement to monitoring for efficiency and improved readability of logs and stats.
As part of this, where possible, tried to avoid updating loop state on READ messages in leveled processes (as was the case when tracking stats within each process).
No performance benefits found with change, but improved stats has helped discover other potential gains.
2022-12-18 20:18:03 +00:00
|
|
|
CacheOverSize =
|
|
|
|
maybe_cache_too_big(
|
|
|
|
State#state.levelzero_size,
|
|
|
|
State#state.levelzero_maxcachesize,
|
|
|
|
State#state.levelzero_cointoss),
|
|
|
|
CacheAlreadyFull =
|
|
|
|
leveled_pmem:cache_full(State#state.levelzero_cache),
|
2023-02-08 10:54:56 +00:00
|
|
|
% Check for a backlog of work
|
|
|
|
{WL, WC} = leveled_pmanifest:check_for_work(State#state.manifest),
|
|
|
|
case {WC, (CacheAlreadyFull or CacheOverSize)} of
|
|
|
|
{0, false} ->
|
|
|
|
% No work required
|
|
|
|
{noreply, State#state{work_backlog = false}};
|
|
|
|
{WC, true} when WC < ?WORKQUEUE_BACKLOG_TOLERANCE ->
|
Develop 3.1 d30update (#386)
* Mas i370 patch d (#383)
* Refactor penciller memory
In high-volume tests on large key-count clusters, so significant variation in the P0031 time has been seen:
TimeBucket PatchA
a.0ms_to_1ms 18554
b.1ms_to_2ms 51778
c.2ms_to_3ms 696
d.3ms_to_5ms 220
e.5ms_to_8ms 59
f.8ms_to_13ms 40
g.13ms_to_21ms 364
h.21ms_to_34ms 277
i.34ms_to_55ms 34
j.55ms_to_89ms 17
k.89ms_to_144ms 21
l.144ms_to_233ms 31
m.233ms_to_377ms 45
n.377ms_to_610ms 52
o.610ms_to_987ms 59
p.987ms_to_1597ms 55
q.1597ms_to_2684ms 54
r.2684ms_to_4281ms 29
s.4281ms_to_6965ms 7
t.6295ms_to_11246ms 1
It is unclear why this varies so much. The time to add to the cache appears to be minimal (but perhaps there is an issue with timing points in the code), whereas the time to add to the index is much more significant and variable. There is also variable time when the memory is rolled (although the actual activity here appears to be minimal.
The refactoring here is two-fold:
- tidy and simplify by keeping LoopState managed within handle_call, and add more helpful dialyzer specs;
- change the update to the index to be a simple extension of a list, rather than any conversion.
This alternative version of the pmem index in unit test is orders of magnitude faster to add - and is the same order of magnitude to check. Anticipation is that it may be more efficient in terms of memory changes.
* Compress SST index
Reduces the size of the leveled_sst index with two changes:
1 - Where there is a common prefix of tuple elements (e.g. Bucket) across the whole leveled_sst file - only the non-common part is indexed, and a function is used to compare.
2 - There is less "indexing" of the index i.e. only 1 in 16 keys are passed into the gb_trees part instead of 1 in 4
* Immediate hibernate
Reasons for delay in hibernate were not clear.
Straight after creation the process will not be in receipt of messages (must wait for the manifest to be updated), so better to hibernate now. This also means the log PC023 provides more accurate information.
* Refactor BIC
This patch avoids the following:
- repeated replacement of the same element in the BIC (via get_kvrange), by checking presence via GET before sing SET
- Stops re-reading of all elements to discover high modified date
Also there appears to have been a bug where a missing HMD for the file is required to add to the cache. However, now the cache may be erased without erasing the HMD. This means that the cache can never be rebuilt
* Use correct size in test results
erts_debug:flat_size/1 returns size in words (i.e. 8 bytes on 64-bit CPU) not bytes
* Don't change summary record
As it is persisted as part of the file write, any change to the summary record cannot be rolled back
* Clerk to prompt L0 write
Simplifies the logic if the clerk request work for the penciller prompts L0 writes as well as Manifest changes.
The advantage now is that if the penciller memory is full, and PUT load stops, the clerk should still be able to prompt persistence. the penciller can therefore make use of dead time this way
* Add push on journal compact
If there has been a backlog, followed by a quiet period - there may be a large ledger cache left unpushed. Journal compaction events are about once per hour, so the performance overhead of a false push should be minimal, with the advantage of clearing any backlog before load starts again.
This is only relevant to riak users with very off/full batch type workloads.
* Extend tests
To more consistently trigger all overload scenarios
* Fix range keys smaller than prefix
Can't make end key an empty binary in this case, as it may be bigger than any keys within the range, but will appear to be smaller.
Unit tests and ct tests added to expose the potential issue
* Tidy-up
- Remove penciller logs which are no longer called
- Get pclerk to only wait MIN_TIMEOUT after doing work, in case there is a backlog
- Remove update_levelzero_cache function as it is unique to handle_call of push_mem, and simple enough to be inline
- Alight testutil slow offer with standard slow offer used
* Tidy-up
Remove pre-otp20 references.
Reinstate the check that the starting pid is still active, this was added to tidy up shutdown.
Resolve failure to run on otp20 due to `-if` sttaement
* Tidy up
Using null rather then {null, Key} is potentially clearer as it is not a concern what they Key is in this case, and removes a comparison step from the leveled_codec:endkey_passed/2 function.
There were issues with coverage in eunit tests as the leveled_pclerk shut down. This prompted a general tidy of leveled_pclerk (remove passing of LoopState into internal functions, and add dialyzer specs.
* Remove R16 relic
* Further testing another issue
The StartKey must always be less than or equal to the prefix when the first N characters are stripped, but this is not true of the EndKey (for the query) which does not have to be between the FirstKey and the LastKey.
If the EndKey query does not match it must be greater than the Prefix (as otherwise it would not have been greater than the FirstKey - so set to null.
* Fix unit test
Unit test had a typo - and result interpretation had a misunderstanding.
* Code and spec tidy
Also look to the cover the situation when the FirstKey is the same as the Prefix with tests.
This is, in theory, not an issue as it is the EndKey for each sublist which is indexed in leveled_tree. However, guard against it mapping to null here, just in case there are dangers lurking (note that tests will still pass without `M > N` guard in place.
* Hibernate on BIC complete
There are three situations when the BIC becomes complete:
- In a file created as part of a merge the BIS is learned in the merge
- After startup, files below L1 learn the block cache through reads that happen to read the block, eventually the while cache will be read, unless...
- Either before/after the cache is complete, it can get whiped by a timeout after a get_sqn request (e.g. as prompted by a journal compaction) ... it will then be re-filled of the back of get/get-range requests.
In all these situations we want to hibernate after the BIC is fill - to reflect the fact that the LoopState should now be relatively stable, so it is a good point to GC and rationalise location of data.
Previously on the the first base was covered. Now all three are covered through the bic_complete message.
* Test all index keys have same term
This works functionally, but is not optimised (the term is replicated in the index)
* Summaries with same index term
If the summary index all have the same index term - only the object keys need to be indexes
* Simplify case statements
We either match the pattern of <<Prefix:N, Suffix>> or the answer should be null
* OK for M == N
If M = N for the first key, it will have a suffix of <<>>. This will match (as expected) a query Start Key of the sam size, and be smaller than any query Start Key that has the same prefix.
If the query Start Key does not match the prefix - it will be null - as it must be smaller than the Prefix (as other wise the query Start Key would be bigger than the Last Key).
The constraint of M > N was introduced before the *_prefix_filter functions were checking the prefix, to avoid issues. Now the prefix is being checked, then M == N is ok.
* Simplify
Correct the test to use a binary field in the range.
To avoid further issue, only apply filter when everything is a binary() type.
* Add test for head_only mode
When leveled is used as a tictacaae key store (in parallel mode), the keys will be head_only entries. Double check they are handled as expected like object keys
* Revert previous change - must support typed buckets
Add assertion to confirm worthwhile optimisation
* Add support for configurable cache multiple (#375)
* Mas i370 patch e (#385)
Improvement to monitoring for efficiency and improved readability of logs and stats.
As part of this, where possible, tried to avoid updating loop state on READ messages in leveled processes (as was the case when tracking stats within each process).
No performance benefits found with change, but improved stats has helped discover other potential gains.
2022-12-18 20:18:03 +00:00
|
|
|
% Rolling the memory to create a new Level Zero file
|
2023-02-08 10:54:56 +00:00
|
|
|
% Must not do this if there is a work backlog beyond the
|
|
|
|
% tolerance, as then the backlog may never be addressed.
|
Develop 3.1 d30update (#386)
* Mas i370 patch d (#383)
* Refactor penciller memory
In high-volume tests on large key-count clusters, so significant variation in the P0031 time has been seen:
TimeBucket PatchA
a.0ms_to_1ms 18554
b.1ms_to_2ms 51778
c.2ms_to_3ms 696
d.3ms_to_5ms 220
e.5ms_to_8ms 59
f.8ms_to_13ms 40
g.13ms_to_21ms 364
h.21ms_to_34ms 277
i.34ms_to_55ms 34
j.55ms_to_89ms 17
k.89ms_to_144ms 21
l.144ms_to_233ms 31
m.233ms_to_377ms 45
n.377ms_to_610ms 52
o.610ms_to_987ms 59
p.987ms_to_1597ms 55
q.1597ms_to_2684ms 54
r.2684ms_to_4281ms 29
s.4281ms_to_6965ms 7
t.6295ms_to_11246ms 1
It is unclear why this varies so much. The time to add to the cache appears to be minimal (but perhaps there is an issue with timing points in the code), whereas the time to add to the index is much more significant and variable. There is also variable time when the memory is rolled (although the actual activity here appears to be minimal.
The refactoring here is two-fold:
- tidy and simplify by keeping LoopState managed within handle_call, and add more helpful dialyzer specs;
- change the update to the index to be a simple extension of a list, rather than any conversion.
This alternative version of the pmem index in unit test is orders of magnitude faster to add - and is the same order of magnitude to check. Anticipation is that it may be more efficient in terms of memory changes.
* Compress SST index
Reduces the size of the leveled_sst index with two changes:
1 - Where there is a common prefix of tuple elements (e.g. Bucket) across the whole leveled_sst file - only the non-common part is indexed, and a function is used to compare.
2 - There is less "indexing" of the index i.e. only 1 in 16 keys are passed into the gb_trees part instead of 1 in 4
* Immediate hibernate
Reasons for delay in hibernate were not clear.
Straight after creation the process will not be in receipt of messages (must wait for the manifest to be updated), so better to hibernate now. This also means the log PC023 provides more accurate information.
* Refactor BIC
This patch avoids the following:
- repeated replacement of the same element in the BIC (via get_kvrange), by checking presence via GET before sing SET
- Stops re-reading of all elements to discover high modified date
Also there appears to have been a bug where a missing HMD for the file is required to add to the cache. However, now the cache may be erased without erasing the HMD. This means that the cache can never be rebuilt
* Use correct size in test results
erts_debug:flat_size/1 returns size in words (i.e. 8 bytes on 64-bit CPU) not bytes
* Don't change summary record
As it is persisted as part of the file write, any change to the summary record cannot be rolled back
* Clerk to prompt L0 write
Simplifies the logic if the clerk request work for the penciller prompts L0 writes as well as Manifest changes.
The advantage now is that if the penciller memory is full, and PUT load stops, the clerk should still be able to prompt persistence. the penciller can therefore make use of dead time this way
* Add push on journal compact
If there has been a backlog, followed by a quiet period - there may be a large ledger cache left unpushed. Journal compaction events are about once per hour, so the performance overhead of a false push should be minimal, with the advantage of clearing any backlog before load starts again.
This is only relevant to riak users with very off/full batch type workloads.
* Extend tests
To more consistently trigger all overload scenarios
* Fix range keys smaller than prefix
Can't make end key an empty binary in this case, as it may be bigger than any keys within the range, but will appear to be smaller.
Unit tests and ct tests added to expose the potential issue
* Tidy-up
- Remove penciller logs which are no longer called
- Get pclerk to only wait MIN_TIMEOUT after doing work, in case there is a backlog
- Remove update_levelzero_cache function as it is unique to handle_call of push_mem, and simple enough to be inline
- Alight testutil slow offer with standard slow offer used
* Tidy-up
Remove pre-otp20 references.
Reinstate the check that the starting pid is still active, this was added to tidy up shutdown.
Resolve failure to run on otp20 due to `-if` sttaement
* Tidy up
Using null rather then {null, Key} is potentially clearer as it is not a concern what they Key is in this case, and removes a comparison step from the leveled_codec:endkey_passed/2 function.
There were issues with coverage in eunit tests as the leveled_pclerk shut down. This prompted a general tidy of leveled_pclerk (remove passing of LoopState into internal functions, and add dialyzer specs.
* Remove R16 relic
* Further testing another issue
The StartKey must always be less than or equal to the prefix when the first N characters are stripped, but this is not true of the EndKey (for the query) which does not have to be between the FirstKey and the LastKey.
If the EndKey query does not match it must be greater than the Prefix (as otherwise it would not have been greater than the FirstKey - so set to null.
* Fix unit test
Unit test had a typo - and result interpretation had a misunderstanding.
* Code and spec tidy
Also look to the cover the situation when the FirstKey is the same as the Prefix with tests.
This is, in theory, not an issue as it is the EndKey for each sublist which is indexed in leveled_tree. However, guard against it mapping to null here, just in case there are dangers lurking (note that tests will still pass without `M > N` guard in place.
* Hibernate on BIC complete
There are three situations when the BIC becomes complete:
- In a file created as part of a merge the BIS is learned in the merge
- After startup, files below L1 learn the block cache through reads that happen to read the block, eventually the while cache will be read, unless...
- Either before/after the cache is complete, it can get whiped by a timeout after a get_sqn request (e.g. as prompted by a journal compaction) ... it will then be re-filled of the back of get/get-range requests.
In all these situations we want to hibernate after the BIC is fill - to reflect the fact that the LoopState should now be relatively stable, so it is a good point to GC and rationalise location of data.
Previously on the the first base was covered. Now all three are covered through the bic_complete message.
* Test all index keys have same term
This works functionally, but is not optimised (the term is replicated in the index)
* Summaries with same index term
If the summary index all have the same index term - only the object keys need to be indexes
* Simplify case statements
We either match the pattern of <<Prefix:N, Suffix>> or the answer should be null
* OK for M == N
If M = N for the first key, it will have a suffix of <<>>. This will match (as expected) a query Start Key of the sam size, and be smaller than any query Start Key that has the same prefix.
If the query Start Key does not match the prefix - it will be null - as it must be smaller than the Prefix (as other wise the query Start Key would be bigger than the Last Key).
The constraint of M > N was introduced before the *_prefix_filter functions were checking the prefix, to avoid issues. Now the prefix is being checked, then M == N is ok.
* Simplify
Correct the test to use a binary field in the range.
To avoid further issue, only apply filter when everything is a binary() type.
* Add test for head_only mode
When leveled is used as a tictacaae key store (in parallel mode), the keys will be head_only entries. Double check they are handled as expected like object keys
* Revert previous change - must support typed buckets
Add assertion to confirm worthwhile optimisation
* Add support for configurable cache multiple (#375)
* Mas i370 patch e (#385)
Improvement to monitoring for efficiency and improved readability of logs and stats.
As part of this, where possible, tried to avoid updating loop state on READ messages in leveled processes (as was the case when tracking stats within each process).
No performance benefits found with change, but improved stats has helped discover other potential gains.
2022-12-18 20:18:03 +00:00
|
|
|
NextSQN =
|
|
|
|
leveled_pmanifest:get_manifest_sqn(
|
|
|
|
State#state.manifest) + 1,
|
|
|
|
{Constructor, none} =
|
|
|
|
roll_memory(
|
|
|
|
NextSQN,
|
|
|
|
State#state.ledger_sqn,
|
|
|
|
State#state.root_path,
|
|
|
|
none,
|
|
|
|
length(State#state.levelzero_cache),
|
|
|
|
State#state.sst_options,
|
|
|
|
false),
|
2017-01-15 00:52:43 +00:00
|
|
|
{noreply,
|
Develop 3.1 d30update (#386)
* Mas i370 patch d (#383)
* Refactor penciller memory
In high-volume tests on large key-count clusters, so significant variation in the P0031 time has been seen:
TimeBucket PatchA
a.0ms_to_1ms 18554
b.1ms_to_2ms 51778
c.2ms_to_3ms 696
d.3ms_to_5ms 220
e.5ms_to_8ms 59
f.8ms_to_13ms 40
g.13ms_to_21ms 364
h.21ms_to_34ms 277
i.34ms_to_55ms 34
j.55ms_to_89ms 17
k.89ms_to_144ms 21
l.144ms_to_233ms 31
m.233ms_to_377ms 45
n.377ms_to_610ms 52
o.610ms_to_987ms 59
p.987ms_to_1597ms 55
q.1597ms_to_2684ms 54
r.2684ms_to_4281ms 29
s.4281ms_to_6965ms 7
t.6295ms_to_11246ms 1
It is unclear why this varies so much. The time to add to the cache appears to be minimal (but perhaps there is an issue with timing points in the code), whereas the time to add to the index is much more significant and variable. There is also variable time when the memory is rolled (although the actual activity here appears to be minimal.
The refactoring here is two-fold:
- tidy and simplify by keeping LoopState managed within handle_call, and add more helpful dialyzer specs;
- change the update to the index to be a simple extension of a list, rather than any conversion.
This alternative version of the pmem index in unit test is orders of magnitude faster to add - and is the same order of magnitude to check. Anticipation is that it may be more efficient in terms of memory changes.
* Compress SST index
Reduces the size of the leveled_sst index with two changes:
1 - Where there is a common prefix of tuple elements (e.g. Bucket) across the whole leveled_sst file - only the non-common part is indexed, and a function is used to compare.
2 - There is less "indexing" of the index i.e. only 1 in 16 keys are passed into the gb_trees part instead of 1 in 4
* Immediate hibernate
Reasons for delay in hibernate were not clear.
Straight after creation the process will not be in receipt of messages (must wait for the manifest to be updated), so better to hibernate now. This also means the log PC023 provides more accurate information.
* Refactor BIC
This patch avoids the following:
- repeated replacement of the same element in the BIC (via get_kvrange), by checking presence via GET before sing SET
- Stops re-reading of all elements to discover high modified date
Also there appears to have been a bug where a missing HMD for the file is required to add to the cache. However, now the cache may be erased without erasing the HMD. This means that the cache can never be rebuilt
* Use correct size in test results
erts_debug:flat_size/1 returns size in words (i.e. 8 bytes on 64-bit CPU) not bytes
* Don't change summary record
As it is persisted as part of the file write, any change to the summary record cannot be rolled back
* Clerk to prompt L0 write
Simplifies the logic if the clerk request work for the penciller prompts L0 writes as well as Manifest changes.
The advantage now is that if the penciller memory is full, and PUT load stops, the clerk should still be able to prompt persistence. the penciller can therefore make use of dead time this way
* Add push on journal compact
If there has been a backlog, followed by a quiet period - there may be a large ledger cache left unpushed. Journal compaction events are about once per hour, so the performance overhead of a false push should be minimal, with the advantage of clearing any backlog before load starts again.
This is only relevant to riak users with very off/full batch type workloads.
* Extend tests
To more consistently trigger all overload scenarios
* Fix range keys smaller than prefix
Can't make end key an empty binary in this case, as it may be bigger than any keys within the range, but will appear to be smaller.
Unit tests and ct tests added to expose the potential issue
* Tidy-up
- Remove penciller logs which are no longer called
- Get pclerk to only wait MIN_TIMEOUT after doing work, in case there is a backlog
- Remove update_levelzero_cache function as it is unique to handle_call of push_mem, and simple enough to be inline
- Alight testutil slow offer with standard slow offer used
* Tidy-up
Remove pre-otp20 references.
Reinstate the check that the starting pid is still active, this was added to tidy up shutdown.
Resolve failure to run on otp20 due to `-if` sttaement
* Tidy up
Using null rather then {null, Key} is potentially clearer as it is not a concern what they Key is in this case, and removes a comparison step from the leveled_codec:endkey_passed/2 function.
There were issues with coverage in eunit tests as the leveled_pclerk shut down. This prompted a general tidy of leveled_pclerk (remove passing of LoopState into internal functions, and add dialyzer specs.
* Remove R16 relic
* Further testing another issue
The StartKey must always be less than or equal to the prefix when the first N characters are stripped, but this is not true of the EndKey (for the query) which does not have to be between the FirstKey and the LastKey.
If the EndKey query does not match it must be greater than the Prefix (as otherwise it would not have been greater than the FirstKey - so set to null.
* Fix unit test
Unit test had a typo - and result interpretation had a misunderstanding.
* Code and spec tidy
Also look to the cover the situation when the FirstKey is the same as the Prefix with tests.
This is, in theory, not an issue as it is the EndKey for each sublist which is indexed in leveled_tree. However, guard against it mapping to null here, just in case there are dangers lurking (note that tests will still pass without `M > N` guard in place.
* Hibernate on BIC complete
There are three situations when the BIC becomes complete:
- In a file created as part of a merge the BIS is learned in the merge
- After startup, files below L1 learn the block cache through reads that happen to read the block, eventually the while cache will be read, unless...
- Either before/after the cache is complete, it can get whiped by a timeout after a get_sqn request (e.g. as prompted by a journal compaction) ... it will then be re-filled of the back of get/get-range requests.
In all these situations we want to hibernate after the BIC is fill - to reflect the fact that the LoopState should now be relatively stable, so it is a good point to GC and rationalise location of data.
Previously on the the first base was covered. Now all three are covered through the bic_complete message.
* Test all index keys have same term
This works functionally, but is not optimised (the term is replicated in the index)
* Summaries with same index term
If the summary index all have the same index term - only the object keys need to be indexes
* Simplify case statements
We either match the pattern of <<Prefix:N, Suffix>> or the answer should be null
* OK for M == N
If M = N for the first key, it will have a suffix of <<>>. This will match (as expected) a query Start Key of the sam size, and be smaller than any query Start Key that has the same prefix.
If the query Start Key does not match the prefix - it will be null - as it must be smaller than the Prefix (as other wise the query Start Key would be bigger than the Last Key).
The constraint of M > N was introduced before the *_prefix_filter functions were checking the prefix, to avoid issues. Now the prefix is being checked, then M == N is ok.
* Simplify
Correct the test to use a binary field in the range.
To avoid further issue, only apply filter when everything is a binary() type.
* Add test for head_only mode
When leveled is used as a tictacaae key store (in parallel mode), the keys will be head_only entries. Double check they are handled as expected like object keys
* Revert previous change - must support typed buckets
Add assertion to confirm worthwhile optimisation
* Add support for configurable cache multiple (#375)
* Mas i370 patch e (#385)
Improvement to monitoring for efficiency and improved readability of logs and stats.
As part of this, where possible, tried to avoid updating loop state on READ messages in leveled processes (as was the case when tracking stats within each process).
No performance benefits found with change, but improved stats has helped discover other potential gains.
2022-12-18 20:18:03 +00:00
|
|
|
State#state{
|
2023-02-08 10:54:56 +00:00
|
|
|
levelzero_pending = true,
|
|
|
|
levelzero_constructor = Constructor,
|
|
|
|
work_backlog = false}};
|
|
|
|
{WC, L0Full} ->
|
|
|
|
% Address the backlog of work, either because there is no
|
|
|
|
% L0 work to do, or because the backlog has grown beyond
|
|
|
|
% tolerance
|
|
|
|
Backlog = WC >= ?WORKQUEUE_BACKLOG_TOLERANCE,
|
|
|
|
leveled_log:log(p0024, [WC, Backlog, L0Full]),
|
|
|
|
[TL|_Tail] = WL,
|
|
|
|
ok =
|
|
|
|
leveled_pclerk:clerk_push(
|
|
|
|
State#state.clerk, {TL, State#state.manifest}),
|
|
|
|
{noreply,
|
|
|
|
State#state{
|
|
|
|
work_backlog = Backlog, work_ongoing = true}}
|
|
|
|
end
|
2018-12-11 21:59:57 +00:00
|
|
|
end;
|
2019-02-26 18:16:47 +00:00
|
|
|
handle_cast({fetch_levelzero, Slot, ReturnFun}, State) ->
|
|
|
|
ReturnFun(lists:nth(Slot, State#state.levelzero_cache)),
|
|
|
|
{noreply, State};
|
2018-12-11 21:59:57 +00:00
|
|
|
handle_cast({log_level, LogLevel}, State) ->
|
2023-11-08 09:18:01 +00:00
|
|
|
update_clerk(
|
|
|
|
State#state.clerk, fun leveled_pclerk:clerk_loglevel/2, LogLevel),
|
2018-12-11 21:59:57 +00:00
|
|
|
SSTopts = State#state.sst_options,
|
|
|
|
SSTopts0 = SSTopts#sst_options{log_options = leveled_log:get_opts()},
|
|
|
|
{noreply, State#state{sst_options = SSTopts0}};
|
|
|
|
handle_cast({add_logs, ForcedLogs}, State) ->
|
2023-11-08 09:18:01 +00:00
|
|
|
update_clerk(
|
|
|
|
State#state.clerk, fun leveled_pclerk:clerk_addlogs/2, ForcedLogs),
|
2018-12-11 21:59:57 +00:00
|
|
|
ok = leveled_log:add_forcedlogs(ForcedLogs),
|
|
|
|
SSTopts = State#state.sst_options,
|
|
|
|
SSTopts0 = SSTopts#sst_options{log_options = leveled_log:get_opts()},
|
|
|
|
{noreply, State#state{sst_options = SSTopts0}};
|
|
|
|
handle_cast({remove_logs, ForcedLogs}, State) ->
|
2023-11-08 09:18:01 +00:00
|
|
|
update_clerk(
|
|
|
|
State#state.clerk, fun leveled_pclerk:clerk_removelogs/2, ForcedLogs),
|
2018-12-11 21:59:57 +00:00
|
|
|
ok = leveled_log:remove_forcedlogs(ForcedLogs),
|
|
|
|
SSTopts = State#state.sst_options,
|
|
|
|
SSTopts0 = SSTopts#sst_options{log_options = leveled_log:get_opts()},
|
2023-10-03 18:30:40 +01:00
|
|
|
{noreply, State#state{sst_options = SSTopts0}};
|
|
|
|
handle_cast({maybe_defer_shutdown, ShutdownType, From}, State) ->
|
|
|
|
case length(leveled_pmanifest:snapshot_pids(State#state.manifest)) of
|
|
|
|
0 ->
|
2023-11-10 15:04:47 +00:00
|
|
|
gen_server:cast(self(), {complete_shutdown, ShutdownType, From}),
|
|
|
|
{noreply, State};
|
2023-10-03 18:30:40 +01:00
|
|
|
N ->
|
|
|
|
% Whilst this process sleeps, then any remaining snapshots may
|
|
|
|
% release and have their release messages queued before the
|
|
|
|
% complete_shutdown cast is sent
|
2023-11-10 15:04:47 +00:00
|
|
|
case State#state.shutdown_loops of
|
|
|
|
LoopCount when LoopCount > 0 ->
|
|
|
|
leveled_log:log(p0042, [N]),
|
|
|
|
timer:sleep(?SHUTDOWN_PAUSE div ?SHUTDOWN_LOOPS),
|
|
|
|
gen_server:cast(
|
|
|
|
self(), {maybe_defer_shutdown, ShutdownType, From}),
|
|
|
|
{noreply, State#state{shutdown_loops = LoopCount - 1}};
|
|
|
|
0 ->
|
|
|
|
gen_server:cast(
|
|
|
|
self(), {complete_shutdown, ShutdownType, From}),
|
|
|
|
{noreply, State}
|
|
|
|
end
|
|
|
|
end;
|
2023-10-03 18:30:40 +01:00
|
|
|
handle_cast({complete_shutdown, ShutdownType, From}, State) ->
|
|
|
|
lists:foreach(
|
|
|
|
fun(Snap) -> ok = pcl_snapclose(Snap) end,
|
|
|
|
leveled_pmanifest:snapshot_pids(State#state.manifest)),
|
|
|
|
shutdown_manifest(State#state.manifest, State#state.levelzero_constructor),
|
|
|
|
case ShutdownType of
|
|
|
|
doom ->
|
|
|
|
ManifestFP = State#state.root_path ++ "/" ++ ?MANIFEST_FP ++ "/",
|
|
|
|
FilesFP = State#state.root_path ++ "/" ++ ?FILES_FP ++ "/",
|
|
|
|
gen_server:reply(From, {ok, [ManifestFP, FilesFP]});
|
|
|
|
close ->
|
|
|
|
gen_server:reply(From, ok)
|
|
|
|
end,
|
|
|
|
{stop, normal, State}.
|
2016-09-26 10:55:08 +01:00
|
|
|
|
2016-10-30 18:25:30 +00:00
|
|
|
|
2018-08-16 10:37:30 +01:00
|
|
|
%% handle the bookie stopping and stop this snapshot
|
|
|
|
handle_info({'DOWN', BookieMonRef, process, _BookiePid, _Info},
|
|
|
|
State=#state{bookie_monref = BookieMonRef}) ->
|
|
|
|
ok = pcl_releasesnapshot(State#state.source_penciller, self()),
|
|
|
|
{stop, normal, State};
|
2016-11-05 15:59:31 +00:00
|
|
|
handle_info(_Info, State) ->
|
2016-09-26 10:55:08 +01:00
|
|
|
{noreply, State}.
|
|
|
|
|
2018-04-10 09:51:21 +01:00
|
|
|
terminate(Reason, _State=#state{is_snapshot=Snap}) when Snap == true ->
|
Develop 3.1 d30update (#386)
* Mas i370 patch d (#383)
* Refactor penciller memory
In high-volume tests on large key-count clusters, so significant variation in the P0031 time has been seen:
TimeBucket PatchA
a.0ms_to_1ms 18554
b.1ms_to_2ms 51778
c.2ms_to_3ms 696
d.3ms_to_5ms 220
e.5ms_to_8ms 59
f.8ms_to_13ms 40
g.13ms_to_21ms 364
h.21ms_to_34ms 277
i.34ms_to_55ms 34
j.55ms_to_89ms 17
k.89ms_to_144ms 21
l.144ms_to_233ms 31
m.233ms_to_377ms 45
n.377ms_to_610ms 52
o.610ms_to_987ms 59
p.987ms_to_1597ms 55
q.1597ms_to_2684ms 54
r.2684ms_to_4281ms 29
s.4281ms_to_6965ms 7
t.6295ms_to_11246ms 1
It is unclear why this varies so much. The time to add to the cache appears to be minimal (but perhaps there is an issue with timing points in the code), whereas the time to add to the index is much more significant and variable. There is also variable time when the memory is rolled (although the actual activity here appears to be minimal.
The refactoring here is two-fold:
- tidy and simplify by keeping LoopState managed within handle_call, and add more helpful dialyzer specs;
- change the update to the index to be a simple extension of a list, rather than any conversion.
This alternative version of the pmem index in unit test is orders of magnitude faster to add - and is the same order of magnitude to check. Anticipation is that it may be more efficient in terms of memory changes.
* Compress SST index
Reduces the size of the leveled_sst index with two changes:
1 - Where there is a common prefix of tuple elements (e.g. Bucket) across the whole leveled_sst file - only the non-common part is indexed, and a function is used to compare.
2 - There is less "indexing" of the index i.e. only 1 in 16 keys are passed into the gb_trees part instead of 1 in 4
* Immediate hibernate
Reasons for delay in hibernate were not clear.
Straight after creation the process will not be in receipt of messages (must wait for the manifest to be updated), so better to hibernate now. This also means the log PC023 provides more accurate information.
* Refactor BIC
This patch avoids the following:
- repeated replacement of the same element in the BIC (via get_kvrange), by checking presence via GET before sing SET
- Stops re-reading of all elements to discover high modified date
Also there appears to have been a bug where a missing HMD for the file is required to add to the cache. However, now the cache may be erased without erasing the HMD. This means that the cache can never be rebuilt
* Use correct size in test results
erts_debug:flat_size/1 returns size in words (i.e. 8 bytes on 64-bit CPU) not bytes
* Don't change summary record
As it is persisted as part of the file write, any change to the summary record cannot be rolled back
* Clerk to prompt L0 write
Simplifies the logic if the clerk request work for the penciller prompts L0 writes as well as Manifest changes.
The advantage now is that if the penciller memory is full, and PUT load stops, the clerk should still be able to prompt persistence. the penciller can therefore make use of dead time this way
* Add push on journal compact
If there has been a backlog, followed by a quiet period - there may be a large ledger cache left unpushed. Journal compaction events are about once per hour, so the performance overhead of a false push should be minimal, with the advantage of clearing any backlog before load starts again.
This is only relevant to riak users with very off/full batch type workloads.
* Extend tests
To more consistently trigger all overload scenarios
* Fix range keys smaller than prefix
Can't make end key an empty binary in this case, as it may be bigger than any keys within the range, but will appear to be smaller.
Unit tests and ct tests added to expose the potential issue
* Tidy-up
- Remove penciller logs which are no longer called
- Get pclerk to only wait MIN_TIMEOUT after doing work, in case there is a backlog
- Remove update_levelzero_cache function as it is unique to handle_call of push_mem, and simple enough to be inline
- Alight testutil slow offer with standard slow offer used
* Tidy-up
Remove pre-otp20 references.
Reinstate the check that the starting pid is still active, this was added to tidy up shutdown.
Resolve failure to run on otp20 due to `-if` sttaement
* Tidy up
Using null rather then {null, Key} is potentially clearer as it is not a concern what they Key is in this case, and removes a comparison step from the leveled_codec:endkey_passed/2 function.
There were issues with coverage in eunit tests as the leveled_pclerk shut down. This prompted a general tidy of leveled_pclerk (remove passing of LoopState into internal functions, and add dialyzer specs.
* Remove R16 relic
* Further testing another issue
The StartKey must always be less than or equal to the prefix when the first N characters are stripped, but this is not true of the EndKey (for the query) which does not have to be between the FirstKey and the LastKey.
If the EndKey query does not match it must be greater than the Prefix (as otherwise it would not have been greater than the FirstKey - so set to null.
* Fix unit test
Unit test had a typo - and result interpretation had a misunderstanding.
* Code and spec tidy
Also look to the cover the situation when the FirstKey is the same as the Prefix with tests.
This is, in theory, not an issue as it is the EndKey for each sublist which is indexed in leveled_tree. However, guard against it mapping to null here, just in case there are dangers lurking (note that tests will still pass without `M > N` guard in place.
* Hibernate on BIC complete
There are three situations when the BIC becomes complete:
- In a file created as part of a merge the BIS is learned in the merge
- After startup, files below L1 learn the block cache through reads that happen to read the block, eventually the while cache will be read, unless...
- Either before/after the cache is complete, it can get whiped by a timeout after a get_sqn request (e.g. as prompted by a journal compaction) ... it will then be re-filled of the back of get/get-range requests.
In all these situations we want to hibernate after the BIC is fill - to reflect the fact that the LoopState should now be relatively stable, so it is a good point to GC and rationalise location of data.
Previously on the the first base was covered. Now all three are covered through the bic_complete message.
* Test all index keys have same term
This works functionally, but is not optimised (the term is replicated in the index)
* Summaries with same index term
If the summary index all have the same index term - only the object keys need to be indexes
* Simplify case statements
We either match the pattern of <<Prefix:N, Suffix>> or the answer should be null
* OK for M == N
If M = N for the first key, it will have a suffix of <<>>. This will match (as expected) a query Start Key of the sam size, and be smaller than any query Start Key that has the same prefix.
If the query Start Key does not match the prefix - it will be null - as it must be smaller than the Prefix (as other wise the query Start Key would be bigger than the Last Key).
The constraint of M > N was introduced before the *_prefix_filter functions were checking the prefix, to avoid issues. Now the prefix is being checked, then M == N is ok.
* Simplify
Correct the test to use a binary field in the range.
To avoid further issue, only apply filter when everything is a binary() type.
* Add test for head_only mode
When leveled is used as a tictacaae key store (in parallel mode), the keys will be head_only entries. Double check they are handled as expected like object keys
* Revert previous change - must support typed buckets
Add assertion to confirm worthwhile optimisation
* Add support for configurable cache multiple (#375)
* Mas i370 patch e (#385)
Improvement to monitoring for efficiency and improved readability of logs and stats.
As part of this, where possible, tried to avoid updating loop state on READ messages in leveled processes (as was the case when tracking stats within each process).
No performance benefits found with change, but improved stats has helped discover other potential gains.
2022-12-18 20:18:03 +00:00
|
|
|
leveled_log:log(p0007, [Reason]);
|
2018-04-10 09:51:21 +01:00
|
|
|
terminate(Reason, _State) ->
|
Develop 3.1 d30update (#386)
* Mas i370 patch d (#383)
* Refactor penciller memory
In high-volume tests on large key-count clusters, so significant variation in the P0031 time has been seen:
TimeBucket PatchA
a.0ms_to_1ms 18554
b.1ms_to_2ms 51778
c.2ms_to_3ms 696
d.3ms_to_5ms 220
e.5ms_to_8ms 59
f.8ms_to_13ms 40
g.13ms_to_21ms 364
h.21ms_to_34ms 277
i.34ms_to_55ms 34
j.55ms_to_89ms 17
k.89ms_to_144ms 21
l.144ms_to_233ms 31
m.233ms_to_377ms 45
n.377ms_to_610ms 52
o.610ms_to_987ms 59
p.987ms_to_1597ms 55
q.1597ms_to_2684ms 54
r.2684ms_to_4281ms 29
s.4281ms_to_6965ms 7
t.6295ms_to_11246ms 1
It is unclear why this varies so much. The time to add to the cache appears to be minimal (but perhaps there is an issue with timing points in the code), whereas the time to add to the index is much more significant and variable. There is also variable time when the memory is rolled (although the actual activity here appears to be minimal.
The refactoring here is two-fold:
- tidy and simplify by keeping LoopState managed within handle_call, and add more helpful dialyzer specs;
- change the update to the index to be a simple extension of a list, rather than any conversion.
This alternative version of the pmem index in unit test is orders of magnitude faster to add - and is the same order of magnitude to check. Anticipation is that it may be more efficient in terms of memory changes.
* Compress SST index
Reduces the size of the leveled_sst index with two changes:
1 - Where there is a common prefix of tuple elements (e.g. Bucket) across the whole leveled_sst file - only the non-common part is indexed, and a function is used to compare.
2 - There is less "indexing" of the index i.e. only 1 in 16 keys are passed into the gb_trees part instead of 1 in 4
* Immediate hibernate
Reasons for delay in hibernate were not clear.
Straight after creation the process will not be in receipt of messages (must wait for the manifest to be updated), so better to hibernate now. This also means the log PC023 provides more accurate information.
* Refactor BIC
This patch avoids the following:
- repeated replacement of the same element in the BIC (via get_kvrange), by checking presence via GET before sing SET
- Stops re-reading of all elements to discover high modified date
Also there appears to have been a bug where a missing HMD for the file is required to add to the cache. However, now the cache may be erased without erasing the HMD. This means that the cache can never be rebuilt
* Use correct size in test results
erts_debug:flat_size/1 returns size in words (i.e. 8 bytes on 64-bit CPU) not bytes
* Don't change summary record
As it is persisted as part of the file write, any change to the summary record cannot be rolled back
* Clerk to prompt L0 write
Simplifies the logic if the clerk request work for the penciller prompts L0 writes as well as Manifest changes.
The advantage now is that if the penciller memory is full, and PUT load stops, the clerk should still be able to prompt persistence. the penciller can therefore make use of dead time this way
* Add push on journal compact
If there has been a backlog, followed by a quiet period - there may be a large ledger cache left unpushed. Journal compaction events are about once per hour, so the performance overhead of a false push should be minimal, with the advantage of clearing any backlog before load starts again.
This is only relevant to riak users with very off/full batch type workloads.
* Extend tests
To more consistently trigger all overload scenarios
* Fix range keys smaller than prefix
Can't make end key an empty binary in this case, as it may be bigger than any keys within the range, but will appear to be smaller.
Unit tests and ct tests added to expose the potential issue
* Tidy-up
- Remove penciller logs which are no longer called
- Get pclerk to only wait MIN_TIMEOUT after doing work, in case there is a backlog
- Remove update_levelzero_cache function as it is unique to handle_call of push_mem, and simple enough to be inline
- Alight testutil slow offer with standard slow offer used
* Tidy-up
Remove pre-otp20 references.
Reinstate the check that the starting pid is still active, this was added to tidy up shutdown.
Resolve failure to run on otp20 due to `-if` sttaement
* Tidy up
Using null rather then {null, Key} is potentially clearer as it is not a concern what they Key is in this case, and removes a comparison step from the leveled_codec:endkey_passed/2 function.
There were issues with coverage in eunit tests as the leveled_pclerk shut down. This prompted a general tidy of leveled_pclerk (remove passing of LoopState into internal functions, and add dialyzer specs.
* Remove R16 relic
* Further testing another issue
The StartKey must always be less than or equal to the prefix when the first N characters are stripped, but this is not true of the EndKey (for the query) which does not have to be between the FirstKey and the LastKey.
If the EndKey query does not match it must be greater than the Prefix (as otherwise it would not have been greater than the FirstKey - so set to null.
* Fix unit test
Unit test had a typo - and result interpretation had a misunderstanding.
* Code and spec tidy
Also look to the cover the situation when the FirstKey is the same as the Prefix with tests.
This is, in theory, not an issue as it is the EndKey for each sublist which is indexed in leveled_tree. However, guard against it mapping to null here, just in case there are dangers lurking (note that tests will still pass without `M > N` guard in place.
* Hibernate on BIC complete
There are three situations when the BIC becomes complete:
- In a file created as part of a merge the BIS is learned in the merge
- After startup, files below L1 learn the block cache through reads that happen to read the block, eventually the while cache will be read, unless...
- Either before/after the cache is complete, it can get whiped by a timeout after a get_sqn request (e.g. as prompted by a journal compaction) ... it will then be re-filled of the back of get/get-range requests.
In all these situations we want to hibernate after the BIC is fill - to reflect the fact that the LoopState should now be relatively stable, so it is a good point to GC and rationalise location of data.
Previously on the the first base was covered. Now all three are covered through the bic_complete message.
* Test all index keys have same term
This works functionally, but is not optimised (the term is replicated in the index)
* Summaries with same index term
If the summary index all have the same index term - only the object keys need to be indexes
* Simplify case statements
We either match the pattern of <<Prefix:N, Suffix>> or the answer should be null
* OK for M == N
If M = N for the first key, it will have a suffix of <<>>. This will match (as expected) a query Start Key of the sam size, and be smaller than any query Start Key that has the same prefix.
If the query Start Key does not match the prefix - it will be null - as it must be smaller than the Prefix (as other wise the query Start Key would be bigger than the Last Key).
The constraint of M > N was introduced before the *_prefix_filter functions were checking the prefix, to avoid issues. Now the prefix is being checked, then M == N is ok.
* Simplify
Correct the test to use a binary field in the range.
To avoid further issue, only apply filter when everything is a binary() type.
* Add test for head_only mode
When leveled is used as a tictacaae key store (in parallel mode), the keys will be head_only entries. Double check they are handled as expected like object keys
* Revert previous change - must support typed buckets
Add assertion to confirm worthwhile optimisation
* Add support for configurable cache multiple (#375)
* Mas i370 patch e (#385)
Improvement to monitoring for efficiency and improved readability of logs and stats.
As part of this, where possible, tried to avoid updating loop state on READ messages in leveled processes (as was the case when tracking stats within each process).
No performance benefits found with change, but improved stats has helped discover other potential gains.
2022-12-18 20:18:03 +00:00
|
|
|
leveled_log:log(p0011, [Reason]).
|
2016-09-26 10:55:08 +01:00
|
|
|
|
2021-10-27 13:42:53 +01:00
|
|
|
format_status(normal, [_PDict, State]) ->
|
|
|
|
State;
|
|
|
|
format_status(terminate, [_PDict, State]) ->
|
|
|
|
State#state{manifest = redacted,
|
|
|
|
levelzero_cache = redacted,
|
|
|
|
levelzero_index = redacted,
|
|
|
|
levelzero_astree = redacted}.
|
|
|
|
|
2016-09-26 10:55:08 +01:00
|
|
|
|
|
|
|
code_change(_OldVsn, State, _Extra) ->
|
|
|
|
{ok, State}.
|
|
|
|
|
|
|
|
|
2017-03-09 21:23:09 +00:00
|
|
|
%%%============================================================================
|
|
|
|
%%% Path functions
|
|
|
|
%%%============================================================================
|
|
|
|
|
|
|
|
sst_rootpath(RootPath) ->
|
|
|
|
FP = RootPath ++ "/" ++ ?FILES_FP,
|
|
|
|
filelib:ensure_dir(FP ++ "/"),
|
|
|
|
FP.
|
|
|
|
|
|
|
|
sst_filename(ManSQN, Level, Count) ->
|
2023-10-03 18:30:40 +01:00
|
|
|
lists:flatten(
|
|
|
|
io_lib:format("./~w_~w_~w" ++ ?SST_FILEX, [ManSQN, Level, Count])).
|
2017-03-09 21:23:09 +00:00
|
|
|
|
|
|
|
|
2016-08-09 16:09:29 +01:00
|
|
|
%%%============================================================================
|
|
|
|
%%% Internal functions
|
|
|
|
%%%============================================================================
|
|
|
|
|
2023-11-08 09:18:01 +00:00
|
|
|
-spec update_clerk(pid()|undefined, fun((pid(), term()) -> ok), term()) -> ok.
|
|
|
|
update_clerk(undefined, _F, _T) ->
|
|
|
|
ok;
|
|
|
|
update_clerk(Clerk, F, T) when is_pid(Clerk) ->
|
|
|
|
F(Clerk, T).
|
|
|
|
|
2018-05-01 21:28:40 +01:00
|
|
|
-spec start_from_file(penciller_options()) -> {ok, pcl_state()}.
|
|
|
|
%% @doc
|
|
|
|
%% Normal start of a penciller (i.e. not a snapshot), needs to read the
|
|
|
|
%% filesystem and reconstruct the ledger from the files that it finds
|
2016-09-21 18:31:42 +01:00
|
|
|
start_from_file(PCLopts) ->
|
|
|
|
RootPath = PCLopts#penciller_options.root_path,
|
2018-09-14 17:22:25 +01:00
|
|
|
MaxTableSize = PCLopts#penciller_options.max_inmemory_tablesize,
|
2018-12-11 20:42:00 +00:00
|
|
|
OptsSST = PCLopts#penciller_options.sst_options,
|
Develop 3.1 d30update (#386)
* Mas i370 patch d (#383)
* Refactor penciller memory
In high-volume tests on large key-count clusters, so significant variation in the P0031 time has been seen:
TimeBucket PatchA
a.0ms_to_1ms 18554
b.1ms_to_2ms 51778
c.2ms_to_3ms 696
d.3ms_to_5ms 220
e.5ms_to_8ms 59
f.8ms_to_13ms 40
g.13ms_to_21ms 364
h.21ms_to_34ms 277
i.34ms_to_55ms 34
j.55ms_to_89ms 17
k.89ms_to_144ms 21
l.144ms_to_233ms 31
m.233ms_to_377ms 45
n.377ms_to_610ms 52
o.610ms_to_987ms 59
p.987ms_to_1597ms 55
q.1597ms_to_2684ms 54
r.2684ms_to_4281ms 29
s.4281ms_to_6965ms 7
t.6295ms_to_11246ms 1
It is unclear why this varies so much. The time to add to the cache appears to be minimal (but perhaps there is an issue with timing points in the code), whereas the time to add to the index is much more significant and variable. There is also variable time when the memory is rolled (although the actual activity here appears to be minimal.
The refactoring here is two-fold:
- tidy and simplify by keeping LoopState managed within handle_call, and add more helpful dialyzer specs;
- change the update to the index to be a simple extension of a list, rather than any conversion.
This alternative version of the pmem index in unit test is orders of magnitude faster to add - and is the same order of magnitude to check. Anticipation is that it may be more efficient in terms of memory changes.
* Compress SST index
Reduces the size of the leveled_sst index with two changes:
1 - Where there is a common prefix of tuple elements (e.g. Bucket) across the whole leveled_sst file - only the non-common part is indexed, and a function is used to compare.
2 - There is less "indexing" of the index i.e. only 1 in 16 keys are passed into the gb_trees part instead of 1 in 4
* Immediate hibernate
Reasons for delay in hibernate were not clear.
Straight after creation the process will not be in receipt of messages (must wait for the manifest to be updated), so better to hibernate now. This also means the log PC023 provides more accurate information.
* Refactor BIC
This patch avoids the following:
- repeated replacement of the same element in the BIC (via get_kvrange), by checking presence via GET before sing SET
- Stops re-reading of all elements to discover high modified date
Also there appears to have been a bug where a missing HMD for the file is required to add to the cache. However, now the cache may be erased without erasing the HMD. This means that the cache can never be rebuilt
* Use correct size in test results
erts_debug:flat_size/1 returns size in words (i.e. 8 bytes on 64-bit CPU) not bytes
* Don't change summary record
As it is persisted as part of the file write, any change to the summary record cannot be rolled back
* Clerk to prompt L0 write
Simplifies the logic if the clerk request work for the penciller prompts L0 writes as well as Manifest changes.
The advantage now is that if the penciller memory is full, and PUT load stops, the clerk should still be able to prompt persistence. the penciller can therefore make use of dead time this way
* Add push on journal compact
If there has been a backlog, followed by a quiet period - there may be a large ledger cache left unpushed. Journal compaction events are about once per hour, so the performance overhead of a false push should be minimal, with the advantage of clearing any backlog before load starts again.
This is only relevant to riak users with very off/full batch type workloads.
* Extend tests
To more consistently trigger all overload scenarios
* Fix range keys smaller than prefix
Can't make end key an empty binary in this case, as it may be bigger than any keys within the range, but will appear to be smaller.
Unit tests and ct tests added to expose the potential issue
* Tidy-up
- Remove penciller logs which are no longer called
- Get pclerk to only wait MIN_TIMEOUT after doing work, in case there is a backlog
- Remove update_levelzero_cache function as it is unique to handle_call of push_mem, and simple enough to be inline
- Alight testutil slow offer with standard slow offer used
* Tidy-up
Remove pre-otp20 references.
Reinstate the check that the starting pid is still active, this was added to tidy up shutdown.
Resolve failure to run on otp20 due to `-if` sttaement
* Tidy up
Using null rather then {null, Key} is potentially clearer as it is not a concern what they Key is in this case, and removes a comparison step from the leveled_codec:endkey_passed/2 function.
There were issues with coverage in eunit tests as the leveled_pclerk shut down. This prompted a general tidy of leveled_pclerk (remove passing of LoopState into internal functions, and add dialyzer specs.
* Remove R16 relic
* Further testing another issue
The StartKey must always be less than or equal to the prefix when the first N characters are stripped, but this is not true of the EndKey (for the query) which does not have to be between the FirstKey and the LastKey.
If the EndKey query does not match it must be greater than the Prefix (as otherwise it would not have been greater than the FirstKey - so set to null.
* Fix unit test
Unit test had a typo - and result interpretation had a misunderstanding.
* Code and spec tidy
Also look to the cover the situation when the FirstKey is the same as the Prefix with tests.
This is, in theory, not an issue as it is the EndKey for each sublist which is indexed in leveled_tree. However, guard against it mapping to null here, just in case there are dangers lurking (note that tests will still pass without `M > N` guard in place.
* Hibernate on BIC complete
There are three situations when the BIC becomes complete:
- In a file created as part of a merge the BIS is learned in the merge
- After startup, files below L1 learn the block cache through reads that happen to read the block, eventually the while cache will be read, unless...
- Either before/after the cache is complete, it can get whiped by a timeout after a get_sqn request (e.g. as prompted by a journal compaction) ... it will then be re-filled of the back of get/get-range requests.
In all these situations we want to hibernate after the BIC is fill - to reflect the fact that the LoopState should now be relatively stable, so it is a good point to GC and rationalise location of data.
Previously on the the first base was covered. Now all three are covered through the bic_complete message.
* Test all index keys have same term
This works functionally, but is not optimised (the term is replicated in the index)
* Summaries with same index term
If the summary index all have the same index term - only the object keys need to be indexes
* Simplify case statements
We either match the pattern of <<Prefix:N, Suffix>> or the answer should be null
* OK for M == N
If M = N for the first key, it will have a suffix of <<>>. This will match (as expected) a query Start Key of the sam size, and be smaller than any query Start Key that has the same prefix.
If the query Start Key does not match the prefix - it will be null - as it must be smaller than the Prefix (as other wise the query Start Key would be bigger than the Last Key).
The constraint of M > N was introduced before the *_prefix_filter functions were checking the prefix, to avoid issues. Now the prefix is being checked, then M == N is ok.
* Simplify
Correct the test to use a binary field in the range.
To avoid further issue, only apply filter when everything is a binary() type.
* Add test for head_only mode
When leveled is used as a tictacaae key store (in parallel mode), the keys will be head_only entries. Double check they are handled as expected like object keys
* Revert previous change - must support typed buckets
Add assertion to confirm worthwhile optimisation
* Add support for configurable cache multiple (#375)
* Mas i370 patch e (#385)
Improvement to monitoring for efficiency and improved readability of logs and stats.
As part of this, where possible, tried to avoid updating loop state on READ messages in leveled processes (as was the case when tracking stats within each process).
No performance benefits found with change, but improved stats has helped discover other potential gains.
2022-12-18 20:18:03 +00:00
|
|
|
Monitor = PCLopts#penciller_options.monitor,
|
2018-12-14 11:23:04 +00:00
|
|
|
|
|
|
|
SnapTimeoutShort = PCLopts#penciller_options.snaptimeout_short,
|
|
|
|
SnapTimeoutLong = PCLopts#penciller_options.snaptimeout_long,
|
2016-10-27 20:56:18 +01:00
|
|
|
|
2018-12-11 20:42:00 +00:00
|
|
|
{ok, MergeClerk} = leveled_pclerk:clerk_new(self(), RootPath, OptsSST),
|
2016-12-09 14:36:03 +00:00
|
|
|
|
|
|
|
CoinToss = PCLopts#penciller_options.levelzero_cointoss,
|
|
|
|
% Used to randomly defer the writing of L0 file. Intended to help with
|
|
|
|
% vnode syncronisation issues (e.g. stop them all by default merging to
|
|
|
|
% level zero concurrently)
|
|
|
|
|
2017-11-06 15:54:58 +00:00
|
|
|
InitState = #state{clerk = MergeClerk,
|
|
|
|
root_path = RootPath,
|
|
|
|
levelzero_maxcachesize = MaxTableSize,
|
|
|
|
levelzero_cointoss = CoinToss,
|
Develop 3.1 d30update (#386)
* Mas i370 patch d (#383)
* Refactor penciller memory
In high-volume tests on large key-count clusters, so significant variation in the P0031 time has been seen:
TimeBucket PatchA
a.0ms_to_1ms 18554
b.1ms_to_2ms 51778
c.2ms_to_3ms 696
d.3ms_to_5ms 220
e.5ms_to_8ms 59
f.8ms_to_13ms 40
g.13ms_to_21ms 364
h.21ms_to_34ms 277
i.34ms_to_55ms 34
j.55ms_to_89ms 17
k.89ms_to_144ms 21
l.144ms_to_233ms 31
m.233ms_to_377ms 45
n.377ms_to_610ms 52
o.610ms_to_987ms 59
p.987ms_to_1597ms 55
q.1597ms_to_2684ms 54
r.2684ms_to_4281ms 29
s.4281ms_to_6965ms 7
t.6295ms_to_11246ms 1
It is unclear why this varies so much. The time to add to the cache appears to be minimal (but perhaps there is an issue with timing points in the code), whereas the time to add to the index is much more significant and variable. There is also variable time when the memory is rolled (although the actual activity here appears to be minimal.
The refactoring here is two-fold:
- tidy and simplify by keeping LoopState managed within handle_call, and add more helpful dialyzer specs;
- change the update to the index to be a simple extension of a list, rather than any conversion.
This alternative version of the pmem index in unit test is orders of magnitude faster to add - and is the same order of magnitude to check. Anticipation is that it may be more efficient in terms of memory changes.
* Compress SST index
Reduces the size of the leveled_sst index with two changes:
1 - Where there is a common prefix of tuple elements (e.g. Bucket) across the whole leveled_sst file - only the non-common part is indexed, and a function is used to compare.
2 - There is less "indexing" of the index i.e. only 1 in 16 keys are passed into the gb_trees part instead of 1 in 4
* Immediate hibernate
Reasons for delay in hibernate were not clear.
Straight after creation the process will not be in receipt of messages (must wait for the manifest to be updated), so better to hibernate now. This also means the log PC023 provides more accurate information.
* Refactor BIC
This patch avoids the following:
- repeated replacement of the same element in the BIC (via get_kvrange), by checking presence via GET before sing SET
- Stops re-reading of all elements to discover high modified date
Also there appears to have been a bug where a missing HMD for the file is required to add to the cache. However, now the cache may be erased without erasing the HMD. This means that the cache can never be rebuilt
* Use correct size in test results
erts_debug:flat_size/1 returns size in words (i.e. 8 bytes on 64-bit CPU) not bytes
* Don't change summary record
As it is persisted as part of the file write, any change to the summary record cannot be rolled back
* Clerk to prompt L0 write
Simplifies the logic if the clerk request work for the penciller prompts L0 writes as well as Manifest changes.
The advantage now is that if the penciller memory is full, and PUT load stops, the clerk should still be able to prompt persistence. the penciller can therefore make use of dead time this way
* Add push on journal compact
If there has been a backlog, followed by a quiet period - there may be a large ledger cache left unpushed. Journal compaction events are about once per hour, so the performance overhead of a false push should be minimal, with the advantage of clearing any backlog before load starts again.
This is only relevant to riak users with very off/full batch type workloads.
* Extend tests
To more consistently trigger all overload scenarios
* Fix range keys smaller than prefix
Can't make end key an empty binary in this case, as it may be bigger than any keys within the range, but will appear to be smaller.
Unit tests and ct tests added to expose the potential issue
* Tidy-up
- Remove penciller logs which are no longer called
- Get pclerk to only wait MIN_TIMEOUT after doing work, in case there is a backlog
- Remove update_levelzero_cache function as it is unique to handle_call of push_mem, and simple enough to be inline
- Alight testutil slow offer with standard slow offer used
* Tidy-up
Remove pre-otp20 references.
Reinstate the check that the starting pid is still active, this was added to tidy up shutdown.
Resolve failure to run on otp20 due to `-if` sttaement
* Tidy up
Using null rather then {null, Key} is potentially clearer as it is not a concern what they Key is in this case, and removes a comparison step from the leveled_codec:endkey_passed/2 function.
There were issues with coverage in eunit tests as the leveled_pclerk shut down. This prompted a general tidy of leveled_pclerk (remove passing of LoopState into internal functions, and add dialyzer specs.
* Remove R16 relic
* Further testing another issue
The StartKey must always be less than or equal to the prefix when the first N characters are stripped, but this is not true of the EndKey (for the query) which does not have to be between the FirstKey and the LastKey.
If the EndKey query does not match it must be greater than the Prefix (as otherwise it would not have been greater than the FirstKey - so set to null.
* Fix unit test
Unit test had a typo - and result interpretation had a misunderstanding.
* Code and spec tidy
Also look to the cover the situation when the FirstKey is the same as the Prefix with tests.
This is, in theory, not an issue as it is the EndKey for each sublist which is indexed in leveled_tree. However, guard against it mapping to null here, just in case there are dangers lurking (note that tests will still pass without `M > N` guard in place.
* Hibernate on BIC complete
There are three situations when the BIC becomes complete:
- In a file created as part of a merge the BIS is learned in the merge
- After startup, files below L1 learn the block cache through reads that happen to read the block, eventually the while cache will be read, unless...
- Either before/after the cache is complete, it can get whiped by a timeout after a get_sqn request (e.g. as prompted by a journal compaction) ... it will then be re-filled of the back of get/get-range requests.
In all these situations we want to hibernate after the BIC is fill - to reflect the fact that the LoopState should now be relatively stable, so it is a good point to GC and rationalise location of data.
Previously on the the first base was covered. Now all three are covered through the bic_complete message.
* Test all index keys have same term
This works functionally, but is not optimised (the term is replicated in the index)
* Summaries with same index term
If the summary index all have the same index term - only the object keys need to be indexes
* Simplify case statements
We either match the pattern of <<Prefix:N, Suffix>> or the answer should be null
* OK for M == N
If M = N for the first key, it will have a suffix of <<>>. This will match (as expected) a query Start Key of the sam size, and be smaller than any query Start Key that has the same prefix.
If the query Start Key does not match the prefix - it will be null - as it must be smaller than the Prefix (as other wise the query Start Key would be bigger than the Last Key).
The constraint of M > N was introduced before the *_prefix_filter functions were checking the prefix, to avoid issues. Now the prefix is being checked, then M == N is ok.
* Simplify
Correct the test to use a binary field in the range.
To avoid further issue, only apply filter when everything is a binary() type.
* Add test for head_only mode
When leveled is used as a tictacaae key store (in parallel mode), the keys will be head_only entries. Double check they are handled as expected like object keys
* Revert previous change - must support typed buckets
Add assertion to confirm worthwhile optimisation
* Add support for configurable cache multiple (#375)
* Mas i370 patch e (#385)
Improvement to monitoring for efficiency and improved readability of logs and stats.
As part of this, where possible, tried to avoid updating loop state on READ messages in leveled processes (as was the case when tracking stats within each process).
No performance benefits found with change, but improved stats has helped discover other potential gains.
2022-12-18 20:18:03 +00:00
|
|
|
levelzero_index = [],
|
2018-12-14 11:23:04 +00:00
|
|
|
snaptimeout_short = SnapTimeoutShort,
|
|
|
|
snaptimeout_long = SnapTimeoutLong,
|
Develop 3.1 d30update (#386)
* Mas i370 patch d (#383)
* Refactor penciller memory
In high-volume tests on large key-count clusters, so significant variation in the P0031 time has been seen:
TimeBucket PatchA
a.0ms_to_1ms 18554
b.1ms_to_2ms 51778
c.2ms_to_3ms 696
d.3ms_to_5ms 220
e.5ms_to_8ms 59
f.8ms_to_13ms 40
g.13ms_to_21ms 364
h.21ms_to_34ms 277
i.34ms_to_55ms 34
j.55ms_to_89ms 17
k.89ms_to_144ms 21
l.144ms_to_233ms 31
m.233ms_to_377ms 45
n.377ms_to_610ms 52
o.610ms_to_987ms 59
p.987ms_to_1597ms 55
q.1597ms_to_2684ms 54
r.2684ms_to_4281ms 29
s.4281ms_to_6965ms 7
t.6295ms_to_11246ms 1
It is unclear why this varies so much. The time to add to the cache appears to be minimal (but perhaps there is an issue with timing points in the code), whereas the time to add to the index is much more significant and variable. There is also variable time when the memory is rolled (although the actual activity here appears to be minimal.
The refactoring here is two-fold:
- tidy and simplify by keeping LoopState managed within handle_call, and add more helpful dialyzer specs;
- change the update to the index to be a simple extension of a list, rather than any conversion.
This alternative version of the pmem index in unit test is orders of magnitude faster to add - and is the same order of magnitude to check. Anticipation is that it may be more efficient in terms of memory changes.
* Compress SST index
Reduces the size of the leveled_sst index with two changes:
1 - Where there is a common prefix of tuple elements (e.g. Bucket) across the whole leveled_sst file - only the non-common part is indexed, and a function is used to compare.
2 - There is less "indexing" of the index i.e. only 1 in 16 keys are passed into the gb_trees part instead of 1 in 4
* Immediate hibernate
Reasons for delay in hibernate were not clear.
Straight after creation the process will not be in receipt of messages (must wait for the manifest to be updated), so better to hibernate now. This also means the log PC023 provides more accurate information.
* Refactor BIC
This patch avoids the following:
- repeated replacement of the same element in the BIC (via get_kvrange), by checking presence via GET before sing SET
- Stops re-reading of all elements to discover high modified date
Also there appears to have been a bug where a missing HMD for the file is required to add to the cache. However, now the cache may be erased without erasing the HMD. This means that the cache can never be rebuilt
* Use correct size in test results
erts_debug:flat_size/1 returns size in words (i.e. 8 bytes on 64-bit CPU) not bytes
* Don't change summary record
As it is persisted as part of the file write, any change to the summary record cannot be rolled back
* Clerk to prompt L0 write
Simplifies the logic if the clerk request work for the penciller prompts L0 writes as well as Manifest changes.
The advantage now is that if the penciller memory is full, and PUT load stops, the clerk should still be able to prompt persistence. the penciller can therefore make use of dead time this way
* Add push on journal compact
If there has been a backlog, followed by a quiet period - there may be a large ledger cache left unpushed. Journal compaction events are about once per hour, so the performance overhead of a false push should be minimal, with the advantage of clearing any backlog before load starts again.
This is only relevant to riak users with very off/full batch type workloads.
* Extend tests
To more consistently trigger all overload scenarios
* Fix range keys smaller than prefix
Can't make end key an empty binary in this case, as it may be bigger than any keys within the range, but will appear to be smaller.
Unit tests and ct tests added to expose the potential issue
* Tidy-up
- Remove penciller logs which are no longer called
- Get pclerk to only wait MIN_TIMEOUT after doing work, in case there is a backlog
- Remove update_levelzero_cache function as it is unique to handle_call of push_mem, and simple enough to be inline
- Alight testutil slow offer with standard slow offer used
* Tidy-up
Remove pre-otp20 references.
Reinstate the check that the starting pid is still active, this was added to tidy up shutdown.
Resolve failure to run on otp20 due to `-if` sttaement
* Tidy up
Using null rather then {null, Key} is potentially clearer as it is not a concern what they Key is in this case, and removes a comparison step from the leveled_codec:endkey_passed/2 function.
There were issues with coverage in eunit tests as the leveled_pclerk shut down. This prompted a general tidy of leveled_pclerk (remove passing of LoopState into internal functions, and add dialyzer specs.
* Remove R16 relic
* Further testing another issue
The StartKey must always be less than or equal to the prefix when the first N characters are stripped, but this is not true of the EndKey (for the query) which does not have to be between the FirstKey and the LastKey.
If the EndKey query does not match it must be greater than the Prefix (as otherwise it would not have been greater than the FirstKey - so set to null.
* Fix unit test
Unit test had a typo - and result interpretation had a misunderstanding.
* Code and spec tidy
Also look to the cover the situation when the FirstKey is the same as the Prefix with tests.
This is, in theory, not an issue as it is the EndKey for each sublist which is indexed in leveled_tree. However, guard against it mapping to null here, just in case there are dangers lurking (note that tests will still pass without `M > N` guard in place.
* Hibernate on BIC complete
There are three situations when the BIC becomes complete:
- In a file created as part of a merge the BIS is learned in the merge
- After startup, files below L1 learn the block cache through reads that happen to read the block, eventually the while cache will be read, unless...
- Either before/after the cache is complete, it can get whiped by a timeout after a get_sqn request (e.g. as prompted by a journal compaction) ... it will then be re-filled of the back of get/get-range requests.
In all these situations we want to hibernate after the BIC is fill - to reflect the fact that the LoopState should now be relatively stable, so it is a good point to GC and rationalise location of data.
Previously on the the first base was covered. Now all three are covered through the bic_complete message.
* Test all index keys have same term
This works functionally, but is not optimised (the term is replicated in the index)
* Summaries with same index term
If the summary index all have the same index term - only the object keys need to be indexes
* Simplify case statements
We either match the pattern of <<Prefix:N, Suffix>> or the answer should be null
* OK for M == N
If M = N for the first key, it will have a suffix of <<>>. This will match (as expected) a query Start Key of the sam size, and be smaller than any query Start Key that has the same prefix.
If the query Start Key does not match the prefix - it will be null - as it must be smaller than the Prefix (as other wise the query Start Key would be bigger than the Last Key).
The constraint of M > N was introduced before the *_prefix_filter functions were checking the prefix, to avoid issues. Now the prefix is being checked, then M == N is ok.
* Simplify
Correct the test to use a binary field in the range.
To avoid further issue, only apply filter when everything is a binary() type.
* Add test for head_only mode
When leveled is used as a tictacaae key store (in parallel mode), the keys will be head_only entries. Double check they are handled as expected like object keys
* Revert previous change - must support typed buckets
Add assertion to confirm worthwhile optimisation
* Add support for configurable cache multiple (#375)
* Mas i370 patch e (#385)
Improvement to monitoring for efficiency and improved readability of logs and stats.
As part of this, where possible, tried to avoid updating loop state on READ messages in leveled processes (as was the case when tracking stats within each process).
No performance benefits found with change, but improved stats has helped discover other potential gains.
2022-12-18 20:18:03 +00:00
|
|
|
sst_options = OptsSST,
|
|
|
|
monitor = Monitor},
|
2016-09-21 18:31:42 +01:00
|
|
|
|
|
|
|
%% Open manifest
|
2017-01-17 14:11:50 +00:00
|
|
|
Manifest0 = leveled_pmanifest:open_manifest(RootPath),
|
2017-01-13 18:23:57 +00:00
|
|
|
OpenFun =
|
2019-05-11 15:59:42 +01:00
|
|
|
fun(FN, Level) ->
|
2017-11-28 01:19:30 +00:00
|
|
|
{ok, Pid, {_FK, _LK}, Bloom} =
|
2019-05-11 15:59:42 +01:00
|
|
|
leveled_sst:sst_open(sst_rootpath(RootPath),
|
|
|
|
FN, OptsSST, Level),
|
2017-11-28 01:19:30 +00:00
|
|
|
{Pid, Bloom}
|
2017-01-12 13:48:43 +00:00
|
|
|
end,
|
2017-01-13 18:23:57 +00:00
|
|
|
SQNFun = fun leveled_sst:sst_getmaxsequencenumber/1,
|
2017-09-27 23:52:49 +01:00
|
|
|
{MaxSQN, Manifest1, FileList} =
|
|
|
|
leveled_pmanifest:load_manifest(Manifest0, OpenFun, SQNFun),
|
Develop 3.1 d30update (#386)
* Mas i370 patch d (#383)
* Refactor penciller memory
In high-volume tests on large key-count clusters, so significant variation in the P0031 time has been seen:
TimeBucket PatchA
a.0ms_to_1ms 18554
b.1ms_to_2ms 51778
c.2ms_to_3ms 696
d.3ms_to_5ms 220
e.5ms_to_8ms 59
f.8ms_to_13ms 40
g.13ms_to_21ms 364
h.21ms_to_34ms 277
i.34ms_to_55ms 34
j.55ms_to_89ms 17
k.89ms_to_144ms 21
l.144ms_to_233ms 31
m.233ms_to_377ms 45
n.377ms_to_610ms 52
o.610ms_to_987ms 59
p.987ms_to_1597ms 55
q.1597ms_to_2684ms 54
r.2684ms_to_4281ms 29
s.4281ms_to_6965ms 7
t.6295ms_to_11246ms 1
It is unclear why this varies so much. The time to add to the cache appears to be minimal (but perhaps there is an issue with timing points in the code), whereas the time to add to the index is much more significant and variable. There is also variable time when the memory is rolled (although the actual activity here appears to be minimal.
The refactoring here is two-fold:
- tidy and simplify by keeping LoopState managed within handle_call, and add more helpful dialyzer specs;
- change the update to the index to be a simple extension of a list, rather than any conversion.
This alternative version of the pmem index in unit test is orders of magnitude faster to add - and is the same order of magnitude to check. Anticipation is that it may be more efficient in terms of memory changes.
* Compress SST index
Reduces the size of the leveled_sst index with two changes:
1 - Where there is a common prefix of tuple elements (e.g. Bucket) across the whole leveled_sst file - only the non-common part is indexed, and a function is used to compare.
2 - There is less "indexing" of the index i.e. only 1 in 16 keys are passed into the gb_trees part instead of 1 in 4
* Immediate hibernate
Reasons for delay in hibernate were not clear.
Straight after creation the process will not be in receipt of messages (must wait for the manifest to be updated), so better to hibernate now. This also means the log PC023 provides more accurate information.
* Refactor BIC
This patch avoids the following:
- repeated replacement of the same element in the BIC (via get_kvrange), by checking presence via GET before sing SET
- Stops re-reading of all elements to discover high modified date
Also there appears to have been a bug where a missing HMD for the file is required to add to the cache. However, now the cache may be erased without erasing the HMD. This means that the cache can never be rebuilt
* Use correct size in test results
erts_debug:flat_size/1 returns size in words (i.e. 8 bytes on 64-bit CPU) not bytes
* Don't change summary record
As it is persisted as part of the file write, any change to the summary record cannot be rolled back
* Clerk to prompt L0 write
Simplifies the logic if the clerk request work for the penciller prompts L0 writes as well as Manifest changes.
The advantage now is that if the penciller memory is full, and PUT load stops, the clerk should still be able to prompt persistence. the penciller can therefore make use of dead time this way
* Add push on journal compact
If there has been a backlog, followed by a quiet period - there may be a large ledger cache left unpushed. Journal compaction events are about once per hour, so the performance overhead of a false push should be minimal, with the advantage of clearing any backlog before load starts again.
This is only relevant to riak users with very off/full batch type workloads.
* Extend tests
To more consistently trigger all overload scenarios
* Fix range keys smaller than prefix
Can't make end key an empty binary in this case, as it may be bigger than any keys within the range, but will appear to be smaller.
Unit tests and ct tests added to expose the potential issue
* Tidy-up
- Remove penciller logs which are no longer called
- Get pclerk to only wait MIN_TIMEOUT after doing work, in case there is a backlog
- Remove update_levelzero_cache function as it is unique to handle_call of push_mem, and simple enough to be inline
- Alight testutil slow offer with standard slow offer used
* Tidy-up
Remove pre-otp20 references.
Reinstate the check that the starting pid is still active, this was added to tidy up shutdown.
Resolve failure to run on otp20 due to `-if` sttaement
* Tidy up
Using null rather then {null, Key} is potentially clearer as it is not a concern what they Key is in this case, and removes a comparison step from the leveled_codec:endkey_passed/2 function.
There were issues with coverage in eunit tests as the leveled_pclerk shut down. This prompted a general tidy of leveled_pclerk (remove passing of LoopState into internal functions, and add dialyzer specs.
* Remove R16 relic
* Further testing another issue
The StartKey must always be less than or equal to the prefix when the first N characters are stripped, but this is not true of the EndKey (for the query) which does not have to be between the FirstKey and the LastKey.
If the EndKey query does not match it must be greater than the Prefix (as otherwise it would not have been greater than the FirstKey - so set to null.
* Fix unit test
Unit test had a typo - and result interpretation had a misunderstanding.
* Code and spec tidy
Also look to the cover the situation when the FirstKey is the same as the Prefix with tests.
This is, in theory, not an issue as it is the EndKey for each sublist which is indexed in leveled_tree. However, guard against it mapping to null here, just in case there are dangers lurking (note that tests will still pass without `M > N` guard in place.
* Hibernate on BIC complete
There are three situations when the BIC becomes complete:
- In a file created as part of a merge the BIS is learned in the merge
- After startup, files below L1 learn the block cache through reads that happen to read the block, eventually the while cache will be read, unless...
- Either before/after the cache is complete, it can get whiped by a timeout after a get_sqn request (e.g. as prompted by a journal compaction) ... it will then be re-filled of the back of get/get-range requests.
In all these situations we want to hibernate after the BIC is fill - to reflect the fact that the LoopState should now be relatively stable, so it is a good point to GC and rationalise location of data.
Previously on the the first base was covered. Now all three are covered through the bic_complete message.
* Test all index keys have same term
This works functionally, but is not optimised (the term is replicated in the index)
* Summaries with same index term
If the summary index all have the same index term - only the object keys need to be indexes
* Simplify case statements
We either match the pattern of <<Prefix:N, Suffix>> or the answer should be null
* OK for M == N
If M = N for the first key, it will have a suffix of <<>>. This will match (as expected) a query Start Key of the sam size, and be smaller than any query Start Key that has the same prefix.
If the query Start Key does not match the prefix - it will be null - as it must be smaller than the Prefix (as other wise the query Start Key would be bigger than the Last Key).
The constraint of M > N was introduced before the *_prefix_filter functions were checking the prefix, to avoid issues. Now the prefix is being checked, then M == N is ok.
* Simplify
Correct the test to use a binary field in the range.
To avoid further issue, only apply filter when everything is a binary() type.
* Add test for head_only mode
When leveled is used as a tictacaae key store (in parallel mode), the keys will be head_only entries. Double check they are handled as expected like object keys
* Revert previous change - must support typed buckets
Add assertion to confirm worthwhile optimisation
* Add support for configurable cache multiple (#375)
* Mas i370 patch e (#385)
Improvement to monitoring for efficiency and improved readability of logs and stats.
As part of this, where possible, tried to avoid updating loop state on READ messages in leveled processes (as was the case when tracking stats within each process).
No performance benefits found with change, but improved stats has helped discover other potential gains.
2022-12-18 20:18:03 +00:00
|
|
|
leveled_log:log(p0014, [MaxSQN]),
|
2017-01-17 14:11:50 +00:00
|
|
|
ManSQN = leveled_pmanifest:get_manifest_sqn(Manifest1),
|
Develop 3.1 d30update (#386)
* Mas i370 patch d (#383)
* Refactor penciller memory
In high-volume tests on large key-count clusters, so significant variation in the P0031 time has been seen:
TimeBucket PatchA
a.0ms_to_1ms 18554
b.1ms_to_2ms 51778
c.2ms_to_3ms 696
d.3ms_to_5ms 220
e.5ms_to_8ms 59
f.8ms_to_13ms 40
g.13ms_to_21ms 364
h.21ms_to_34ms 277
i.34ms_to_55ms 34
j.55ms_to_89ms 17
k.89ms_to_144ms 21
l.144ms_to_233ms 31
m.233ms_to_377ms 45
n.377ms_to_610ms 52
o.610ms_to_987ms 59
p.987ms_to_1597ms 55
q.1597ms_to_2684ms 54
r.2684ms_to_4281ms 29
s.4281ms_to_6965ms 7
t.6295ms_to_11246ms 1
It is unclear why this varies so much. The time to add to the cache appears to be minimal (but perhaps there is an issue with timing points in the code), whereas the time to add to the index is much more significant and variable. There is also variable time when the memory is rolled (although the actual activity here appears to be minimal.
The refactoring here is two-fold:
- tidy and simplify by keeping LoopState managed within handle_call, and add more helpful dialyzer specs;
- change the update to the index to be a simple extension of a list, rather than any conversion.
This alternative version of the pmem index in unit test is orders of magnitude faster to add - and is the same order of magnitude to check. Anticipation is that it may be more efficient in terms of memory changes.
* Compress SST index
Reduces the size of the leveled_sst index with two changes:
1 - Where there is a common prefix of tuple elements (e.g. Bucket) across the whole leveled_sst file - only the non-common part is indexed, and a function is used to compare.
2 - There is less "indexing" of the index i.e. only 1 in 16 keys are passed into the gb_trees part instead of 1 in 4
* Immediate hibernate
Reasons for delay in hibernate were not clear.
Straight after creation the process will not be in receipt of messages (must wait for the manifest to be updated), so better to hibernate now. This also means the log PC023 provides more accurate information.
* Refactor BIC
This patch avoids the following:
- repeated replacement of the same element in the BIC (via get_kvrange), by checking presence via GET before sing SET
- Stops re-reading of all elements to discover high modified date
Also there appears to have been a bug where a missing HMD for the file is required to add to the cache. However, now the cache may be erased without erasing the HMD. This means that the cache can never be rebuilt
* Use correct size in test results
erts_debug:flat_size/1 returns size in words (i.e. 8 bytes on 64-bit CPU) not bytes
* Don't change summary record
As it is persisted as part of the file write, any change to the summary record cannot be rolled back
* Clerk to prompt L0 write
Simplifies the logic if the clerk request work for the penciller prompts L0 writes as well as Manifest changes.
The advantage now is that if the penciller memory is full, and PUT load stops, the clerk should still be able to prompt persistence. the penciller can therefore make use of dead time this way
* Add push on journal compact
If there has been a backlog, followed by a quiet period - there may be a large ledger cache left unpushed. Journal compaction events are about once per hour, so the performance overhead of a false push should be minimal, with the advantage of clearing any backlog before load starts again.
This is only relevant to riak users with very off/full batch type workloads.
* Extend tests
To more consistently trigger all overload scenarios
* Fix range keys smaller than prefix
Can't make end key an empty binary in this case, as it may be bigger than any keys within the range, but will appear to be smaller.
Unit tests and ct tests added to expose the potential issue
* Tidy-up
- Remove penciller logs which are no longer called
- Get pclerk to only wait MIN_TIMEOUT after doing work, in case there is a backlog
- Remove update_levelzero_cache function as it is unique to handle_call of push_mem, and simple enough to be inline
- Alight testutil slow offer with standard slow offer used
* Tidy-up
Remove pre-otp20 references.
Reinstate the check that the starting pid is still active, this was added to tidy up shutdown.
Resolve failure to run on otp20 due to `-if` sttaement
* Tidy up
Using null rather then {null, Key} is potentially clearer as it is not a concern what they Key is in this case, and removes a comparison step from the leveled_codec:endkey_passed/2 function.
There were issues with coverage in eunit tests as the leveled_pclerk shut down. This prompted a general tidy of leveled_pclerk (remove passing of LoopState into internal functions, and add dialyzer specs.
* Remove R16 relic
* Further testing another issue
The StartKey must always be less than or equal to the prefix when the first N characters are stripped, but this is not true of the EndKey (for the query) which does not have to be between the FirstKey and the LastKey.
If the EndKey query does not match it must be greater than the Prefix (as otherwise it would not have been greater than the FirstKey - so set to null.
* Fix unit test
Unit test had a typo - and result interpretation had a misunderstanding.
* Code and spec tidy
Also look to the cover the situation when the FirstKey is the same as the Prefix with tests.
This is, in theory, not an issue as it is the EndKey for each sublist which is indexed in leveled_tree. However, guard against it mapping to null here, just in case there are dangers lurking (note that tests will still pass without `M > N` guard in place.
* Hibernate on BIC complete
There are three situations when the BIC becomes complete:
- In a file created as part of a merge the BIS is learned in the merge
- After startup, files below L1 learn the block cache through reads that happen to read the block, eventually the while cache will be read, unless...
- Either before/after the cache is complete, it can get whiped by a timeout after a get_sqn request (e.g. as prompted by a journal compaction) ... it will then be re-filled of the back of get/get-range requests.
In all these situations we want to hibernate after the BIC is fill - to reflect the fact that the LoopState should now be relatively stable, so it is a good point to GC and rationalise location of data.
Previously on the the first base was covered. Now all three are covered through the bic_complete message.
* Test all index keys have same term
This works functionally, but is not optimised (the term is replicated in the index)
* Summaries with same index term
If the summary index all have the same index term - only the object keys need to be indexes
* Simplify case statements
We either match the pattern of <<Prefix:N, Suffix>> or the answer should be null
* OK for M == N
If M = N for the first key, it will have a suffix of <<>>. This will match (as expected) a query Start Key of the sam size, and be smaller than any query Start Key that has the same prefix.
If the query Start Key does not match the prefix - it will be null - as it must be smaller than the Prefix (as other wise the query Start Key would be bigger than the Last Key).
The constraint of M > N was introduced before the *_prefix_filter functions were checking the prefix, to avoid issues. Now the prefix is being checked, then M == N is ok.
* Simplify
Correct the test to use a binary field in the range.
To avoid further issue, only apply filter when everything is a binary() type.
* Add test for head_only mode
When leveled is used as a tictacaae key store (in parallel mode), the keys will be head_only entries. Double check they are handled as expected like object keys
* Revert previous change - must support typed buckets
Add assertion to confirm worthwhile optimisation
* Add support for configurable cache multiple (#375)
* Mas i370 patch e (#385)
Improvement to monitoring for efficiency and improved readability of logs and stats.
As part of this, where possible, tried to avoid updating loop state on READ messages in leveled processes (as was the case when tracking stats within each process).
No performance benefits found with change, but improved stats has helped discover other potential gains.
2022-12-18 20:18:03 +00:00
|
|
|
leveled_log:log(p0035, [ManSQN]),
|
2016-10-27 20:56:18 +01:00
|
|
|
%% Find any L0 files
|
2017-03-09 21:23:09 +00:00
|
|
|
L0FN = sst_filename(ManSQN + 1, 0, 0),
|
2017-09-27 23:52:49 +01:00
|
|
|
{State0, FileList0} =
|
|
|
|
case filelib:is_file(filename:join(sst_rootpath(RootPath), L0FN)) of
|
|
|
|
true ->
|
Develop 3.1 d30update (#386)
* Mas i370 patch d (#383)
* Refactor penciller memory
In high-volume tests on large key-count clusters, so significant variation in the P0031 time has been seen:
TimeBucket PatchA
a.0ms_to_1ms 18554
b.1ms_to_2ms 51778
c.2ms_to_3ms 696
d.3ms_to_5ms 220
e.5ms_to_8ms 59
f.8ms_to_13ms 40
g.13ms_to_21ms 364
h.21ms_to_34ms 277
i.34ms_to_55ms 34
j.55ms_to_89ms 17
k.89ms_to_144ms 21
l.144ms_to_233ms 31
m.233ms_to_377ms 45
n.377ms_to_610ms 52
o.610ms_to_987ms 59
p.987ms_to_1597ms 55
q.1597ms_to_2684ms 54
r.2684ms_to_4281ms 29
s.4281ms_to_6965ms 7
t.6295ms_to_11246ms 1
It is unclear why this varies so much. The time to add to the cache appears to be minimal (but perhaps there is an issue with timing points in the code), whereas the time to add to the index is much more significant and variable. There is also variable time when the memory is rolled (although the actual activity here appears to be minimal.
The refactoring here is two-fold:
- tidy and simplify by keeping LoopState managed within handle_call, and add more helpful dialyzer specs;
- change the update to the index to be a simple extension of a list, rather than any conversion.
This alternative version of the pmem index in unit test is orders of magnitude faster to add - and is the same order of magnitude to check. Anticipation is that it may be more efficient in terms of memory changes.
* Compress SST index
Reduces the size of the leveled_sst index with two changes:
1 - Where there is a common prefix of tuple elements (e.g. Bucket) across the whole leveled_sst file - only the non-common part is indexed, and a function is used to compare.
2 - There is less "indexing" of the index i.e. only 1 in 16 keys are passed into the gb_trees part instead of 1 in 4
* Immediate hibernate
Reasons for delay in hibernate were not clear.
Straight after creation the process will not be in receipt of messages (must wait for the manifest to be updated), so better to hibernate now. This also means the log PC023 provides more accurate information.
* Refactor BIC
This patch avoids the following:
- repeated replacement of the same element in the BIC (via get_kvrange), by checking presence via GET before sing SET
- Stops re-reading of all elements to discover high modified date
Also there appears to have been a bug where a missing HMD for the file is required to add to the cache. However, now the cache may be erased without erasing the HMD. This means that the cache can never be rebuilt
* Use correct size in test results
erts_debug:flat_size/1 returns size in words (i.e. 8 bytes on 64-bit CPU) not bytes
* Don't change summary record
As it is persisted as part of the file write, any change to the summary record cannot be rolled back
* Clerk to prompt L0 write
Simplifies the logic if the clerk request work for the penciller prompts L0 writes as well as Manifest changes.
The advantage now is that if the penciller memory is full, and PUT load stops, the clerk should still be able to prompt persistence. the penciller can therefore make use of dead time this way
* Add push on journal compact
If there has been a backlog, followed by a quiet period - there may be a large ledger cache left unpushed. Journal compaction events are about once per hour, so the performance overhead of a false push should be minimal, with the advantage of clearing any backlog before load starts again.
This is only relevant to riak users with very off/full batch type workloads.
* Extend tests
To more consistently trigger all overload scenarios
* Fix range keys smaller than prefix
Can't make end key an empty binary in this case, as it may be bigger than any keys within the range, but will appear to be smaller.
Unit tests and ct tests added to expose the potential issue
* Tidy-up
- Remove penciller logs which are no longer called
- Get pclerk to only wait MIN_TIMEOUT after doing work, in case there is a backlog
- Remove update_levelzero_cache function as it is unique to handle_call of push_mem, and simple enough to be inline
- Alight testutil slow offer with standard slow offer used
* Tidy-up
Remove pre-otp20 references.
Reinstate the check that the starting pid is still active, this was added to tidy up shutdown.
Resolve failure to run on otp20 due to `-if` sttaement
* Tidy up
Using null rather then {null, Key} is potentially clearer as it is not a concern what they Key is in this case, and removes a comparison step from the leveled_codec:endkey_passed/2 function.
There were issues with coverage in eunit tests as the leveled_pclerk shut down. This prompted a general tidy of leveled_pclerk (remove passing of LoopState into internal functions, and add dialyzer specs.
* Remove R16 relic
* Further testing another issue
The StartKey must always be less than or equal to the prefix when the first N characters are stripped, but this is not true of the EndKey (for the query) which does not have to be between the FirstKey and the LastKey.
If the EndKey query does not match it must be greater than the Prefix (as otherwise it would not have been greater than the FirstKey - so set to null.
* Fix unit test
Unit test had a typo - and result interpretation had a misunderstanding.
* Code and spec tidy
Also look to the cover the situation when the FirstKey is the same as the Prefix with tests.
This is, in theory, not an issue as it is the EndKey for each sublist which is indexed in leveled_tree. However, guard against it mapping to null here, just in case there are dangers lurking (note that tests will still pass without `M > N` guard in place.
* Hibernate on BIC complete
There are three situations when the BIC becomes complete:
- In a file created as part of a merge the BIS is learned in the merge
- After startup, files below L1 learn the block cache through reads that happen to read the block, eventually the while cache will be read, unless...
- Either before/after the cache is complete, it can get whiped by a timeout after a get_sqn request (e.g. as prompted by a journal compaction) ... it will then be re-filled of the back of get/get-range requests.
In all these situations we want to hibernate after the BIC is fill - to reflect the fact that the LoopState should now be relatively stable, so it is a good point to GC and rationalise location of data.
Previously on the the first base was covered. Now all three are covered through the bic_complete message.
* Test all index keys have same term
This works functionally, but is not optimised (the term is replicated in the index)
* Summaries with same index term
If the summary index all have the same index term - only the object keys need to be indexes
* Simplify case statements
We either match the pattern of <<Prefix:N, Suffix>> or the answer should be null
* OK for M == N
If M = N for the first key, it will have a suffix of <<>>. This will match (as expected) a query Start Key of the sam size, and be smaller than any query Start Key that has the same prefix.
If the query Start Key does not match the prefix - it will be null - as it must be smaller than the Prefix (as other wise the query Start Key would be bigger than the Last Key).
The constraint of M > N was introduced before the *_prefix_filter functions were checking the prefix, to avoid issues. Now the prefix is being checked, then M == N is ok.
* Simplify
Correct the test to use a binary field in the range.
To avoid further issue, only apply filter when everything is a binary() type.
* Add test for head_only mode
When leveled is used as a tictacaae key store (in parallel mode), the keys will be head_only entries. Double check they are handled as expected like object keys
* Revert previous change - must support typed buckets
Add assertion to confirm worthwhile optimisation
* Add support for configurable cache multiple (#375)
* Mas i370 patch e (#385)
Improvement to monitoring for efficiency and improved readability of logs and stats.
As part of this, where possible, tried to avoid updating loop state on READ messages in leveled processes (as was the case when tracking stats within each process).
No performance benefits found with change, but improved stats has helped discover other potential gains.
2022-12-18 20:18:03 +00:00
|
|
|
leveled_log:log(p0015, [L0FN]),
|
2018-12-11 20:42:00 +00:00
|
|
|
L0Open = leveled_sst:sst_open(sst_rootpath(RootPath),
|
|
|
|
L0FN,
|
2019-05-11 15:59:42 +01:00
|
|
|
OptsSST,
|
|
|
|
0),
|
2017-11-28 01:19:30 +00:00
|
|
|
{ok, L0Pid, {L0StartKey, L0EndKey}, Bloom} = L0Open,
|
2017-09-27 23:52:49 +01:00
|
|
|
L0SQN = leveled_sst:sst_getmaxsequencenumber(L0Pid),
|
|
|
|
L0Entry = #manifest_entry{start_key = L0StartKey,
|
|
|
|
end_key = L0EndKey,
|
|
|
|
filename = L0FN,
|
2017-11-28 01:19:30 +00:00
|
|
|
owner = L0Pid,
|
|
|
|
bloom = Bloom},
|
2018-12-11 20:42:00 +00:00
|
|
|
Manifest2 =
|
|
|
|
leveled_pmanifest:insert_manifest_entry(Manifest1,
|
|
|
|
ManSQN + 1,
|
|
|
|
0,
|
|
|
|
L0Entry),
|
Develop 3.1 d30update (#386)
* Mas i370 patch d (#383)
* Refactor penciller memory
In high-volume tests on large key-count clusters, so significant variation in the P0031 time has been seen:
TimeBucket PatchA
a.0ms_to_1ms 18554
b.1ms_to_2ms 51778
c.2ms_to_3ms 696
d.3ms_to_5ms 220
e.5ms_to_8ms 59
f.8ms_to_13ms 40
g.13ms_to_21ms 364
h.21ms_to_34ms 277
i.34ms_to_55ms 34
j.55ms_to_89ms 17
k.89ms_to_144ms 21
l.144ms_to_233ms 31
m.233ms_to_377ms 45
n.377ms_to_610ms 52
o.610ms_to_987ms 59
p.987ms_to_1597ms 55
q.1597ms_to_2684ms 54
r.2684ms_to_4281ms 29
s.4281ms_to_6965ms 7
t.6295ms_to_11246ms 1
It is unclear why this varies so much. The time to add to the cache appears to be minimal (but perhaps there is an issue with timing points in the code), whereas the time to add to the index is much more significant and variable. There is also variable time when the memory is rolled (although the actual activity here appears to be minimal.
The refactoring here is two-fold:
- tidy and simplify by keeping LoopState managed within handle_call, and add more helpful dialyzer specs;
- change the update to the index to be a simple extension of a list, rather than any conversion.
This alternative version of the pmem index in unit test is orders of magnitude faster to add - and is the same order of magnitude to check. Anticipation is that it may be more efficient in terms of memory changes.
* Compress SST index
Reduces the size of the leveled_sst index with two changes:
1 - Where there is a common prefix of tuple elements (e.g. Bucket) across the whole leveled_sst file - only the non-common part is indexed, and a function is used to compare.
2 - There is less "indexing" of the index i.e. only 1 in 16 keys are passed into the gb_trees part instead of 1 in 4
* Immediate hibernate
Reasons for delay in hibernate were not clear.
Straight after creation the process will not be in receipt of messages (must wait for the manifest to be updated), so better to hibernate now. This also means the log PC023 provides more accurate information.
* Refactor BIC
This patch avoids the following:
- repeated replacement of the same element in the BIC (via get_kvrange), by checking presence via GET before sing SET
- Stops re-reading of all elements to discover high modified date
Also there appears to have been a bug where a missing HMD for the file is required to add to the cache. However, now the cache may be erased without erasing the HMD. This means that the cache can never be rebuilt
* Use correct size in test results
erts_debug:flat_size/1 returns size in words (i.e. 8 bytes on 64-bit CPU) not bytes
* Don't change summary record
As it is persisted as part of the file write, any change to the summary record cannot be rolled back
* Clerk to prompt L0 write
Simplifies the logic if the clerk request work for the penciller prompts L0 writes as well as Manifest changes.
The advantage now is that if the penciller memory is full, and PUT load stops, the clerk should still be able to prompt persistence. the penciller can therefore make use of dead time this way
* Add push on journal compact
If there has been a backlog, followed by a quiet period - there may be a large ledger cache left unpushed. Journal compaction events are about once per hour, so the performance overhead of a false push should be minimal, with the advantage of clearing any backlog before load starts again.
This is only relevant to riak users with very off/full batch type workloads.
* Extend tests
To more consistently trigger all overload scenarios
* Fix range keys smaller than prefix
Can't make end key an empty binary in this case, as it may be bigger than any keys within the range, but will appear to be smaller.
Unit tests and ct tests added to expose the potential issue
* Tidy-up
- Remove penciller logs which are no longer called
- Get pclerk to only wait MIN_TIMEOUT after doing work, in case there is a backlog
- Remove update_levelzero_cache function as it is unique to handle_call of push_mem, and simple enough to be inline
- Alight testutil slow offer with standard slow offer used
* Tidy-up
Remove pre-otp20 references.
Reinstate the check that the starting pid is still active, this was added to tidy up shutdown.
Resolve failure to run on otp20 due to `-if` sttaement
* Tidy up
Using null rather then {null, Key} is potentially clearer as it is not a concern what they Key is in this case, and removes a comparison step from the leveled_codec:endkey_passed/2 function.
There were issues with coverage in eunit tests as the leveled_pclerk shut down. This prompted a general tidy of leveled_pclerk (remove passing of LoopState into internal functions, and add dialyzer specs.
* Remove R16 relic
* Further testing another issue
The StartKey must always be less than or equal to the prefix when the first N characters are stripped, but this is not true of the EndKey (for the query) which does not have to be between the FirstKey and the LastKey.
If the EndKey query does not match it must be greater than the Prefix (as otherwise it would not have been greater than the FirstKey - so set to null.
* Fix unit test
Unit test had a typo - and result interpretation had a misunderstanding.
* Code and spec tidy
Also look to the cover the situation when the FirstKey is the same as the Prefix with tests.
This is, in theory, not an issue as it is the EndKey for each sublist which is indexed in leveled_tree. However, guard against it mapping to null here, just in case there are dangers lurking (note that tests will still pass without `M > N` guard in place.
* Hibernate on BIC complete
There are three situations when the BIC becomes complete:
- In a file created as part of a merge the BIS is learned in the merge
- After startup, files below L1 learn the block cache through reads that happen to read the block, eventually the while cache will be read, unless...
- Either before/after the cache is complete, it can get whiped by a timeout after a get_sqn request (e.g. as prompted by a journal compaction) ... it will then be re-filled of the back of get/get-range requests.
In all these situations we want to hibernate after the BIC is fill - to reflect the fact that the LoopState should now be relatively stable, so it is a good point to GC and rationalise location of data.
Previously on the the first base was covered. Now all three are covered through the bic_complete message.
* Test all index keys have same term
This works functionally, but is not optimised (the term is replicated in the index)
* Summaries with same index term
If the summary index all have the same index term - only the object keys need to be indexes
* Simplify case statements
We either match the pattern of <<Prefix:N, Suffix>> or the answer should be null
* OK for M == N
If M = N for the first key, it will have a suffix of <<>>. This will match (as expected) a query Start Key of the sam size, and be smaller than any query Start Key that has the same prefix.
If the query Start Key does not match the prefix - it will be null - as it must be smaller than the Prefix (as other wise the query Start Key would be bigger than the Last Key).
The constraint of M > N was introduced before the *_prefix_filter functions were checking the prefix, to avoid issues. Now the prefix is being checked, then M == N is ok.
* Simplify
Correct the test to use a binary field in the range.
To avoid further issue, only apply filter when everything is a binary() type.
* Add test for head_only mode
When leveled is used as a tictacaae key store (in parallel mode), the keys will be head_only entries. Double check they are handled as expected like object keys
* Revert previous change - must support typed buckets
Add assertion to confirm worthwhile optimisation
* Add support for configurable cache multiple (#375)
* Mas i370 patch e (#385)
Improvement to monitoring for efficiency and improved readability of logs and stats.
As part of this, where possible, tried to avoid updating loop state on READ messages in leveled processes (as was the case when tracking stats within each process).
No performance benefits found with change, but improved stats has helped discover other potential gains.
2022-12-18 20:18:03 +00:00
|
|
|
leveled_log:log(p0016, [L0SQN]),
|
2017-09-27 23:52:49 +01:00
|
|
|
LedgerSQN = max(MaxSQN, L0SQN),
|
|
|
|
{InitState#state{manifest = Manifest2,
|
2017-01-12 13:48:43 +00:00
|
|
|
ledger_sqn = LedgerSQN,
|
2017-09-27 23:52:49 +01:00
|
|
|
persisted_sqn = LedgerSQN},
|
|
|
|
[L0FN|FileList]};
|
|
|
|
false ->
|
Develop 3.1 d30update (#386)
* Mas i370 patch d (#383)
* Refactor penciller memory
In high-volume tests on large key-count clusters, so significant variation in the P0031 time has been seen:
TimeBucket PatchA
a.0ms_to_1ms 18554
b.1ms_to_2ms 51778
c.2ms_to_3ms 696
d.3ms_to_5ms 220
e.5ms_to_8ms 59
f.8ms_to_13ms 40
g.13ms_to_21ms 364
h.21ms_to_34ms 277
i.34ms_to_55ms 34
j.55ms_to_89ms 17
k.89ms_to_144ms 21
l.144ms_to_233ms 31
m.233ms_to_377ms 45
n.377ms_to_610ms 52
o.610ms_to_987ms 59
p.987ms_to_1597ms 55
q.1597ms_to_2684ms 54
r.2684ms_to_4281ms 29
s.4281ms_to_6965ms 7
t.6295ms_to_11246ms 1
It is unclear why this varies so much. The time to add to the cache appears to be minimal (but perhaps there is an issue with timing points in the code), whereas the time to add to the index is much more significant and variable. There is also variable time when the memory is rolled (although the actual activity here appears to be minimal.
The refactoring here is two-fold:
- tidy and simplify by keeping LoopState managed within handle_call, and add more helpful dialyzer specs;
- change the update to the index to be a simple extension of a list, rather than any conversion.
This alternative version of the pmem index in unit test is orders of magnitude faster to add - and is the same order of magnitude to check. Anticipation is that it may be more efficient in terms of memory changes.
* Compress SST index
Reduces the size of the leveled_sst index with two changes:
1 - Where there is a common prefix of tuple elements (e.g. Bucket) across the whole leveled_sst file - only the non-common part is indexed, and a function is used to compare.
2 - There is less "indexing" of the index i.e. only 1 in 16 keys are passed into the gb_trees part instead of 1 in 4
* Immediate hibernate
Reasons for delay in hibernate were not clear.
Straight after creation the process will not be in receipt of messages (must wait for the manifest to be updated), so better to hibernate now. This also means the log PC023 provides more accurate information.
* Refactor BIC
This patch avoids the following:
- repeated replacement of the same element in the BIC (via get_kvrange), by checking presence via GET before sing SET
- Stops re-reading of all elements to discover high modified date
Also there appears to have been a bug where a missing HMD for the file is required to add to the cache. However, now the cache may be erased without erasing the HMD. This means that the cache can never be rebuilt
* Use correct size in test results
erts_debug:flat_size/1 returns size in words (i.e. 8 bytes on 64-bit CPU) not bytes
* Don't change summary record
As it is persisted as part of the file write, any change to the summary record cannot be rolled back
* Clerk to prompt L0 write
Simplifies the logic if the clerk request work for the penciller prompts L0 writes as well as Manifest changes.
The advantage now is that if the penciller memory is full, and PUT load stops, the clerk should still be able to prompt persistence. the penciller can therefore make use of dead time this way
* Add push on journal compact
If there has been a backlog, followed by a quiet period - there may be a large ledger cache left unpushed. Journal compaction events are about once per hour, so the performance overhead of a false push should be minimal, with the advantage of clearing any backlog before load starts again.
This is only relevant to riak users with very off/full batch type workloads.
* Extend tests
To more consistently trigger all overload scenarios
* Fix range keys smaller than prefix
Can't make end key an empty binary in this case, as it may be bigger than any keys within the range, but will appear to be smaller.
Unit tests and ct tests added to expose the potential issue
* Tidy-up
- Remove penciller logs which are no longer called
- Get pclerk to only wait MIN_TIMEOUT after doing work, in case there is a backlog
- Remove update_levelzero_cache function as it is unique to handle_call of push_mem, and simple enough to be inline
- Alight testutil slow offer with standard slow offer used
* Tidy-up
Remove pre-otp20 references.
Reinstate the check that the starting pid is still active, this was added to tidy up shutdown.
Resolve failure to run on otp20 due to `-if` sttaement
* Tidy up
Using null rather then {null, Key} is potentially clearer as it is not a concern what they Key is in this case, and removes a comparison step from the leveled_codec:endkey_passed/2 function.
There were issues with coverage in eunit tests as the leveled_pclerk shut down. This prompted a general tidy of leveled_pclerk (remove passing of LoopState into internal functions, and add dialyzer specs.
* Remove R16 relic
* Further testing another issue
The StartKey must always be less than or equal to the prefix when the first N characters are stripped, but this is not true of the EndKey (for the query) which does not have to be between the FirstKey and the LastKey.
If the EndKey query does not match it must be greater than the Prefix (as otherwise it would not have been greater than the FirstKey - so set to null.
* Fix unit test
Unit test had a typo - and result interpretation had a misunderstanding.
* Code and spec tidy
Also look to the cover the situation when the FirstKey is the same as the Prefix with tests.
This is, in theory, not an issue as it is the EndKey for each sublist which is indexed in leveled_tree. However, guard against it mapping to null here, just in case there are dangers lurking (note that tests will still pass without `M > N` guard in place.
* Hibernate on BIC complete
There are three situations when the BIC becomes complete:
- In a file created as part of a merge the BIS is learned in the merge
- After startup, files below L1 learn the block cache through reads that happen to read the block, eventually the while cache will be read, unless...
- Either before/after the cache is complete, it can get whiped by a timeout after a get_sqn request (e.g. as prompted by a journal compaction) ... it will then be re-filled of the back of get/get-range requests.
In all these situations we want to hibernate after the BIC is fill - to reflect the fact that the LoopState should now be relatively stable, so it is a good point to GC and rationalise location of data.
Previously on the the first base was covered. Now all three are covered through the bic_complete message.
* Test all index keys have same term
This works functionally, but is not optimised (the term is replicated in the index)
* Summaries with same index term
If the summary index all have the same index term - only the object keys need to be indexes
* Simplify case statements
We either match the pattern of <<Prefix:N, Suffix>> or the answer should be null
* OK for M == N
If M = N for the first key, it will have a suffix of <<>>. This will match (as expected) a query Start Key of the sam size, and be smaller than any query Start Key that has the same prefix.
If the query Start Key does not match the prefix - it will be null - as it must be smaller than the Prefix (as other wise the query Start Key would be bigger than the Last Key).
The constraint of M > N was introduced before the *_prefix_filter functions were checking the prefix, to avoid issues. Now the prefix is being checked, then M == N is ok.
* Simplify
Correct the test to use a binary field in the range.
To avoid further issue, only apply filter when everything is a binary() type.
* Add test for head_only mode
When leveled is used as a tictacaae key store (in parallel mode), the keys will be head_only entries. Double check they are handled as expected like object keys
* Revert previous change - must support typed buckets
Add assertion to confirm worthwhile optimisation
* Add support for configurable cache multiple (#375)
* Mas i370 patch e (#385)
Improvement to monitoring for efficiency and improved readability of logs and stats.
As part of this, where possible, tried to avoid updating loop state on READ messages in leveled processes (as was the case when tracking stats within each process).
No performance benefits found with change, but improved stats has helped discover other potential gains.
2022-12-18 20:18:03 +00:00
|
|
|
leveled_log:log(p0017, []),
|
2017-09-27 23:52:49 +01:00
|
|
|
{InitState#state{manifest = Manifest1,
|
2017-01-12 13:48:43 +00:00
|
|
|
ledger_sqn = MaxSQN,
|
2017-09-27 23:52:49 +01:00
|
|
|
persisted_sqn = MaxSQN},
|
|
|
|
FileList}
|
|
|
|
end,
|
|
|
|
ok = archive_files(RootPath, FileList0),
|
|
|
|
{ok, State0}.
|
|
|
|
|
2018-04-10 09:51:21 +01:00
|
|
|
|
2019-01-29 13:06:00 +00:00
|
|
|
-spec shutdown_manifest(leveled_pmanifest:manifest(), pid()|undefined) -> ok.
|
2018-04-10 09:51:21 +01:00
|
|
|
%% @doc
|
|
|
|
%% Shutdown all the SST files within the manifest
|
2019-01-29 13:06:00 +00:00
|
|
|
shutdown_manifest(Manifest, L0Constructor) ->
|
2018-04-10 09:51:21 +01:00
|
|
|
EntryCloseFun =
|
|
|
|
fun(ME) ->
|
2019-01-29 13:06:00 +00:00
|
|
|
Owner =
|
|
|
|
case is_record(ME, manifest_entry) of
|
|
|
|
true ->
|
|
|
|
ME#manifest_entry.owner;
|
|
|
|
false ->
|
|
|
|
case ME of
|
|
|
|
{_SK, ME0} ->
|
|
|
|
ME0#manifest_entry.owner;
|
|
|
|
ME ->
|
|
|
|
ME
|
|
|
|
end
|
|
|
|
end,
|
|
|
|
ok =
|
2019-03-13 21:19:32 +00:00
|
|
|
case check_alive(Owner) of
|
2019-01-29 13:06:00 +00:00
|
|
|
true ->
|
|
|
|
leveled_sst:sst_close(Owner);
|
|
|
|
false ->
|
|
|
|
ok
|
|
|
|
end
|
2018-04-10 09:51:21 +01:00
|
|
|
end,
|
2019-01-29 13:06:00 +00:00
|
|
|
leveled_pmanifest:close_manifest(Manifest, EntryCloseFun),
|
|
|
|
EntryCloseFun(L0Constructor).
|
2018-04-10 09:51:21 +01:00
|
|
|
|
|
|
|
|
2019-03-13 21:19:32 +00:00
|
|
|
-spec check_alive(pid()|undefined) -> boolean().
|
|
|
|
%% @doc
|
|
|
|
%% Double-check a processis active before attempting to terminate
|
|
|
|
check_alive(Owner) when is_pid(Owner) ->
|
|
|
|
is_process_alive(Owner);
|
|
|
|
check_alive(_Owner) ->
|
|
|
|
false.
|
|
|
|
|
|
|
|
|
2018-05-01 21:28:40 +01:00
|
|
|
-spec archive_files(list(), list()) -> ok.
|
|
|
|
%% @doc
|
|
|
|
%% Archive any sst files in the folder that have not been used to build the
|
|
|
|
%% ledger at startup. They may have not deeleted as expected, so this saves
|
|
|
|
%% them off as non-SST fies to make it easier for an admin to garbage collect
|
|
|
|
%% theses files
|
2017-09-28 10:50:54 +01:00
|
|
|
archive_files(RootPath, UsedFileList) ->
|
2017-09-27 23:52:49 +01:00
|
|
|
{ok, AllFiles} = file:list_dir(sst_rootpath(RootPath)),
|
|
|
|
FileCheckFun =
|
|
|
|
fun(FN, UnusedFiles) ->
|
|
|
|
FN0 = "./" ++ FN,
|
|
|
|
case filename:extension(FN0) of
|
|
|
|
?SST_FILEX ->
|
2017-09-28 10:50:54 +01:00
|
|
|
case lists:member(FN0, UsedFileList) of
|
2017-09-27 23:52:49 +01:00
|
|
|
true ->
|
|
|
|
UnusedFiles;
|
|
|
|
false ->
|
Develop 3.1 d30update (#386)
* Mas i370 patch d (#383)
* Refactor penciller memory
In high-volume tests on large key-count clusters, so significant variation in the P0031 time has been seen:
TimeBucket PatchA
a.0ms_to_1ms 18554
b.1ms_to_2ms 51778
c.2ms_to_3ms 696
d.3ms_to_5ms 220
e.5ms_to_8ms 59
f.8ms_to_13ms 40
g.13ms_to_21ms 364
h.21ms_to_34ms 277
i.34ms_to_55ms 34
j.55ms_to_89ms 17
k.89ms_to_144ms 21
l.144ms_to_233ms 31
m.233ms_to_377ms 45
n.377ms_to_610ms 52
o.610ms_to_987ms 59
p.987ms_to_1597ms 55
q.1597ms_to_2684ms 54
r.2684ms_to_4281ms 29
s.4281ms_to_6965ms 7
t.6295ms_to_11246ms 1
It is unclear why this varies so much. The time to add to the cache appears to be minimal (but perhaps there is an issue with timing points in the code), whereas the time to add to the index is much more significant and variable. There is also variable time when the memory is rolled (although the actual activity here appears to be minimal.
The refactoring here is two-fold:
- tidy and simplify by keeping LoopState managed within handle_call, and add more helpful dialyzer specs;
- change the update to the index to be a simple extension of a list, rather than any conversion.
This alternative version of the pmem index in unit test is orders of magnitude faster to add - and is the same order of magnitude to check. Anticipation is that it may be more efficient in terms of memory changes.
* Compress SST index
Reduces the size of the leveled_sst index with two changes:
1 - Where there is a common prefix of tuple elements (e.g. Bucket) across the whole leveled_sst file - only the non-common part is indexed, and a function is used to compare.
2 - There is less "indexing" of the index i.e. only 1 in 16 keys are passed into the gb_trees part instead of 1 in 4
* Immediate hibernate
Reasons for delay in hibernate were not clear.
Straight after creation the process will not be in receipt of messages (must wait for the manifest to be updated), so better to hibernate now. This also means the log PC023 provides more accurate information.
* Refactor BIC
This patch avoids the following:
- repeated replacement of the same element in the BIC (via get_kvrange), by checking presence via GET before sing SET
- Stops re-reading of all elements to discover high modified date
Also there appears to have been a bug where a missing HMD for the file is required to add to the cache. However, now the cache may be erased without erasing the HMD. This means that the cache can never be rebuilt
* Use correct size in test results
erts_debug:flat_size/1 returns size in words (i.e. 8 bytes on 64-bit CPU) not bytes
* Don't change summary record
As it is persisted as part of the file write, any change to the summary record cannot be rolled back
* Clerk to prompt L0 write
Simplifies the logic if the clerk request work for the penciller prompts L0 writes as well as Manifest changes.
The advantage now is that if the penciller memory is full, and PUT load stops, the clerk should still be able to prompt persistence. the penciller can therefore make use of dead time this way
* Add push on journal compact
If there has been a backlog, followed by a quiet period - there may be a large ledger cache left unpushed. Journal compaction events are about once per hour, so the performance overhead of a false push should be minimal, with the advantage of clearing any backlog before load starts again.
This is only relevant to riak users with very off/full batch type workloads.
* Extend tests
To more consistently trigger all overload scenarios
* Fix range keys smaller than prefix
Can't make end key an empty binary in this case, as it may be bigger than any keys within the range, but will appear to be smaller.
Unit tests and ct tests added to expose the potential issue
* Tidy-up
- Remove penciller logs which are no longer called
- Get pclerk to only wait MIN_TIMEOUT after doing work, in case there is a backlog
- Remove update_levelzero_cache function as it is unique to handle_call of push_mem, and simple enough to be inline
- Alight testutil slow offer with standard slow offer used
* Tidy-up
Remove pre-otp20 references.
Reinstate the check that the starting pid is still active, this was added to tidy up shutdown.
Resolve failure to run on otp20 due to `-if` sttaement
* Tidy up
Using null rather then {null, Key} is potentially clearer as it is not a concern what they Key is in this case, and removes a comparison step from the leveled_codec:endkey_passed/2 function.
There were issues with coverage in eunit tests as the leveled_pclerk shut down. This prompted a general tidy of leveled_pclerk (remove passing of LoopState into internal functions, and add dialyzer specs.
* Remove R16 relic
* Further testing another issue
The StartKey must always be less than or equal to the prefix when the first N characters are stripped, but this is not true of the EndKey (for the query) which does not have to be between the FirstKey and the LastKey.
If the EndKey query does not match it must be greater than the Prefix (as otherwise it would not have been greater than the FirstKey - so set to null.
* Fix unit test
Unit test had a typo - and result interpretation had a misunderstanding.
* Code and spec tidy
Also look to the cover the situation when the FirstKey is the same as the Prefix with tests.
This is, in theory, not an issue as it is the EndKey for each sublist which is indexed in leveled_tree. However, guard against it mapping to null here, just in case there are dangers lurking (note that tests will still pass without `M > N` guard in place.
* Hibernate on BIC complete
There are three situations when the BIC becomes complete:
- In a file created as part of a merge the BIS is learned in the merge
- After startup, files below L1 learn the block cache through reads that happen to read the block, eventually the while cache will be read, unless...
- Either before/after the cache is complete, it can get whiped by a timeout after a get_sqn request (e.g. as prompted by a journal compaction) ... it will then be re-filled of the back of get/get-range requests.
In all these situations we want to hibernate after the BIC is fill - to reflect the fact that the LoopState should now be relatively stable, so it is a good point to GC and rationalise location of data.
Previously on the the first base was covered. Now all three are covered through the bic_complete message.
* Test all index keys have same term
This works functionally, but is not optimised (the term is replicated in the index)
* Summaries with same index term
If the summary index all have the same index term - only the object keys need to be indexes
* Simplify case statements
We either match the pattern of <<Prefix:N, Suffix>> or the answer should be null
* OK for M == N
If M = N for the first key, it will have a suffix of <<>>. This will match (as expected) a query Start Key of the sam size, and be smaller than any query Start Key that has the same prefix.
If the query Start Key does not match the prefix - it will be null - as it must be smaller than the Prefix (as other wise the query Start Key would be bigger than the Last Key).
The constraint of M > N was introduced before the *_prefix_filter functions were checking the prefix, to avoid issues. Now the prefix is being checked, then M == N is ok.
* Simplify
Correct the test to use a binary field in the range.
To avoid further issue, only apply filter when everything is a binary() type.
* Add test for head_only mode
When leveled is used as a tictacaae key store (in parallel mode), the keys will be head_only entries. Double check they are handled as expected like object keys
* Revert previous change - must support typed buckets
Add assertion to confirm worthwhile optimisation
* Add support for configurable cache multiple (#375)
* Mas i370 patch e (#385)
Improvement to monitoring for efficiency and improved readability of logs and stats.
As part of this, where possible, tried to avoid updating loop state on READ messages in leveled processes (as was the case when tracking stats within each process).
No performance benefits found with change, but improved stats has helped discover other potential gains.
2022-12-18 20:18:03 +00:00
|
|
|
leveled_log:log(p0040, [FN0]),
|
2017-09-27 23:52:49 +01:00
|
|
|
[FN0|UnusedFiles]
|
|
|
|
end;
|
|
|
|
_ ->
|
|
|
|
UnusedFiles
|
|
|
|
end
|
|
|
|
end,
|
|
|
|
RenameFun =
|
|
|
|
fun(FN) ->
|
|
|
|
AltName = filename:join(sst_rootpath(RootPath),
|
|
|
|
filename:basename(FN, ?SST_FILEX))
|
|
|
|
++ ?ARCHIVE_FILEX,
|
|
|
|
file:rename(filename:join(sst_rootpath(RootPath), FN),
|
|
|
|
AltName)
|
|
|
|
end,
|
|
|
|
FilesToArchive = lists:foldl(FileCheckFun, [], AllFiles),
|
|
|
|
lists:foreach(RenameFun, FilesToArchive),
|
|
|
|
ok.
|
2016-09-21 18:31:42 +01:00
|
|
|
|
2016-10-30 18:25:30 +00:00
|
|
|
|
Develop 3.1 d30update (#386)
* Mas i370 patch d (#383)
* Refactor penciller memory
In high-volume tests on large key-count clusters, so significant variation in the P0031 time has been seen:
TimeBucket PatchA
a.0ms_to_1ms 18554
b.1ms_to_2ms 51778
c.2ms_to_3ms 696
d.3ms_to_5ms 220
e.5ms_to_8ms 59
f.8ms_to_13ms 40
g.13ms_to_21ms 364
h.21ms_to_34ms 277
i.34ms_to_55ms 34
j.55ms_to_89ms 17
k.89ms_to_144ms 21
l.144ms_to_233ms 31
m.233ms_to_377ms 45
n.377ms_to_610ms 52
o.610ms_to_987ms 59
p.987ms_to_1597ms 55
q.1597ms_to_2684ms 54
r.2684ms_to_4281ms 29
s.4281ms_to_6965ms 7
t.6295ms_to_11246ms 1
It is unclear why this varies so much. The time to add to the cache appears to be minimal (but perhaps there is an issue with timing points in the code), whereas the time to add to the index is much more significant and variable. There is also variable time when the memory is rolled (although the actual activity here appears to be minimal.
The refactoring here is two-fold:
- tidy and simplify by keeping LoopState managed within handle_call, and add more helpful dialyzer specs;
- change the update to the index to be a simple extension of a list, rather than any conversion.
This alternative version of the pmem index in unit test is orders of magnitude faster to add - and is the same order of magnitude to check. Anticipation is that it may be more efficient in terms of memory changes.
* Compress SST index
Reduces the size of the leveled_sst index with two changes:
1 - Where there is a common prefix of tuple elements (e.g. Bucket) across the whole leveled_sst file - only the non-common part is indexed, and a function is used to compare.
2 - There is less "indexing" of the index i.e. only 1 in 16 keys are passed into the gb_trees part instead of 1 in 4
* Immediate hibernate
Reasons for delay in hibernate were not clear.
Straight after creation the process will not be in receipt of messages (must wait for the manifest to be updated), so better to hibernate now. This also means the log PC023 provides more accurate information.
* Refactor BIC
This patch avoids the following:
- repeated replacement of the same element in the BIC (via get_kvrange), by checking presence via GET before sing SET
- Stops re-reading of all elements to discover high modified date
Also there appears to have been a bug where a missing HMD for the file is required to add to the cache. However, now the cache may be erased without erasing the HMD. This means that the cache can never be rebuilt
* Use correct size in test results
erts_debug:flat_size/1 returns size in words (i.e. 8 bytes on 64-bit CPU) not bytes
* Don't change summary record
As it is persisted as part of the file write, any change to the summary record cannot be rolled back
* Clerk to prompt L0 write
Simplifies the logic if the clerk request work for the penciller prompts L0 writes as well as Manifest changes.
The advantage now is that if the penciller memory is full, and PUT load stops, the clerk should still be able to prompt persistence. the penciller can therefore make use of dead time this way
* Add push on journal compact
If there has been a backlog, followed by a quiet period - there may be a large ledger cache left unpushed. Journal compaction events are about once per hour, so the performance overhead of a false push should be minimal, with the advantage of clearing any backlog before load starts again.
This is only relevant to riak users with very off/full batch type workloads.
* Extend tests
To more consistently trigger all overload scenarios
* Fix range keys smaller than prefix
Can't make end key an empty binary in this case, as it may be bigger than any keys within the range, but will appear to be smaller.
Unit tests and ct tests added to expose the potential issue
* Tidy-up
- Remove penciller logs which are no longer called
- Get pclerk to only wait MIN_TIMEOUT after doing work, in case there is a backlog
- Remove update_levelzero_cache function as it is unique to handle_call of push_mem, and simple enough to be inline
- Alight testutil slow offer with standard slow offer used
* Tidy-up
Remove pre-otp20 references.
Reinstate the check that the starting pid is still active, this was added to tidy up shutdown.
Resolve failure to run on otp20 due to `-if` sttaement
* Tidy up
Using null rather then {null, Key} is potentially clearer as it is not a concern what they Key is in this case, and removes a comparison step from the leveled_codec:endkey_passed/2 function.
There were issues with coverage in eunit tests as the leveled_pclerk shut down. This prompted a general tidy of leveled_pclerk (remove passing of LoopState into internal functions, and add dialyzer specs.
* Remove R16 relic
* Further testing another issue
The StartKey must always be less than or equal to the prefix when the first N characters are stripped, but this is not true of the EndKey (for the query) which does not have to be between the FirstKey and the LastKey.
If the EndKey query does not match it must be greater than the Prefix (as otherwise it would not have been greater than the FirstKey - so set to null.
* Fix unit test
Unit test had a typo - and result interpretation had a misunderstanding.
* Code and spec tidy
Also look to the cover the situation when the FirstKey is the same as the Prefix with tests.
This is, in theory, not an issue as it is the EndKey for each sublist which is indexed in leveled_tree. However, guard against it mapping to null here, just in case there are dangers lurking (note that tests will still pass without `M > N` guard in place.
* Hibernate on BIC complete
There are three situations when the BIC becomes complete:
- In a file created as part of a merge the BIS is learned in the merge
- After startup, files below L1 learn the block cache through reads that happen to read the block, eventually the while cache will be read, unless...
- Either before/after the cache is complete, it can get whiped by a timeout after a get_sqn request (e.g. as prompted by a journal compaction) ... it will then be re-filled of the back of get/get-range requests.
In all these situations we want to hibernate after the BIC is fill - to reflect the fact that the LoopState should now be relatively stable, so it is a good point to GC and rationalise location of data.
Previously on the the first base was covered. Now all three are covered through the bic_complete message.
* Test all index keys have same term
This works functionally, but is not optimised (the term is replicated in the index)
* Summaries with same index term
If the summary index all have the same index term - only the object keys need to be indexes
* Simplify case statements
We either match the pattern of <<Prefix:N, Suffix>> or the answer should be null
* OK for M == N
If M = N for the first key, it will have a suffix of <<>>. This will match (as expected) a query Start Key of the sam size, and be smaller than any query Start Key that has the same prefix.
If the query Start Key does not match the prefix - it will be null - as it must be smaller than the Prefix (as other wise the query Start Key would be bigger than the Last Key).
The constraint of M > N was introduced before the *_prefix_filter functions were checking the prefix, to avoid issues. Now the prefix is being checked, then M == N is ok.
* Simplify
Correct the test to use a binary field in the range.
To avoid further issue, only apply filter when everything is a binary() type.
* Add test for head_only mode
When leveled is used as a tictacaae key store (in parallel mode), the keys will be head_only entries. Double check they are handled as expected like object keys
* Revert previous change - must support typed buckets
Add assertion to confirm worthwhile optimisation
* Add support for configurable cache multiple (#375)
* Mas i370 patch e (#385)
Improvement to monitoring for efficiency and improved readability of logs and stats.
As part of this, where possible, tried to avoid updating loop state on READ messages in leveled processes (as was the case when tracking stats within each process).
No performance benefits found with change, but improved stats has helped discover other potential gains.
2022-12-18 20:18:03 +00:00
|
|
|
-spec maybe_cache_too_big(
|
|
|
|
pos_integer(), pos_integer(), boolean()) -> boolean().
|
2019-01-14 12:27:51 +00:00
|
|
|
%% @doc
|
Develop 3.1 d30update (#386)
* Mas i370 patch d (#383)
* Refactor penciller memory
In high-volume tests on large key-count clusters, so significant variation in the P0031 time has been seen:
TimeBucket PatchA
a.0ms_to_1ms 18554
b.1ms_to_2ms 51778
c.2ms_to_3ms 696
d.3ms_to_5ms 220
e.5ms_to_8ms 59
f.8ms_to_13ms 40
g.13ms_to_21ms 364
h.21ms_to_34ms 277
i.34ms_to_55ms 34
j.55ms_to_89ms 17
k.89ms_to_144ms 21
l.144ms_to_233ms 31
m.233ms_to_377ms 45
n.377ms_to_610ms 52
o.610ms_to_987ms 59
p.987ms_to_1597ms 55
q.1597ms_to_2684ms 54
r.2684ms_to_4281ms 29
s.4281ms_to_6965ms 7
t.6295ms_to_11246ms 1
It is unclear why this varies so much. The time to add to the cache appears to be minimal (but perhaps there is an issue with timing points in the code), whereas the time to add to the index is much more significant and variable. There is also variable time when the memory is rolled (although the actual activity here appears to be minimal.
The refactoring here is two-fold:
- tidy and simplify by keeping LoopState managed within handle_call, and add more helpful dialyzer specs;
- change the update to the index to be a simple extension of a list, rather than any conversion.
This alternative version of the pmem index in unit test is orders of magnitude faster to add - and is the same order of magnitude to check. Anticipation is that it may be more efficient in terms of memory changes.
* Compress SST index
Reduces the size of the leveled_sst index with two changes:
1 - Where there is a common prefix of tuple elements (e.g. Bucket) across the whole leveled_sst file - only the non-common part is indexed, and a function is used to compare.
2 - There is less "indexing" of the index i.e. only 1 in 16 keys are passed into the gb_trees part instead of 1 in 4
* Immediate hibernate
Reasons for delay in hibernate were not clear.
Straight after creation the process will not be in receipt of messages (must wait for the manifest to be updated), so better to hibernate now. This also means the log PC023 provides more accurate information.
* Refactor BIC
This patch avoids the following:
- repeated replacement of the same element in the BIC (via get_kvrange), by checking presence via GET before sing SET
- Stops re-reading of all elements to discover high modified date
Also there appears to have been a bug where a missing HMD for the file is required to add to the cache. However, now the cache may be erased without erasing the HMD. This means that the cache can never be rebuilt
* Use correct size in test results
erts_debug:flat_size/1 returns size in words (i.e. 8 bytes on 64-bit CPU) not bytes
* Don't change summary record
As it is persisted as part of the file write, any change to the summary record cannot be rolled back
* Clerk to prompt L0 write
Simplifies the logic if the clerk request work for the penciller prompts L0 writes as well as Manifest changes.
The advantage now is that if the penciller memory is full, and PUT load stops, the clerk should still be able to prompt persistence. the penciller can therefore make use of dead time this way
* Add push on journal compact
If there has been a backlog, followed by a quiet period - there may be a large ledger cache left unpushed. Journal compaction events are about once per hour, so the performance overhead of a false push should be minimal, with the advantage of clearing any backlog before load starts again.
This is only relevant to riak users with very off/full batch type workloads.
* Extend tests
To more consistently trigger all overload scenarios
* Fix range keys smaller than prefix
Can't make end key an empty binary in this case, as it may be bigger than any keys within the range, but will appear to be smaller.
Unit tests and ct tests added to expose the potential issue
* Tidy-up
- Remove penciller logs which are no longer called
- Get pclerk to only wait MIN_TIMEOUT after doing work, in case there is a backlog
- Remove update_levelzero_cache function as it is unique to handle_call of push_mem, and simple enough to be inline
- Alight testutil slow offer with standard slow offer used
* Tidy-up
Remove pre-otp20 references.
Reinstate the check that the starting pid is still active, this was added to tidy up shutdown.
Resolve failure to run on otp20 due to `-if` sttaement
* Tidy up
Using null rather then {null, Key} is potentially clearer as it is not a concern what they Key is in this case, and removes a comparison step from the leveled_codec:endkey_passed/2 function.
There were issues with coverage in eunit tests as the leveled_pclerk shut down. This prompted a general tidy of leveled_pclerk (remove passing of LoopState into internal functions, and add dialyzer specs.
* Remove R16 relic
* Further testing another issue
The StartKey must always be less than or equal to the prefix when the first N characters are stripped, but this is not true of the EndKey (for the query) which does not have to be between the FirstKey and the LastKey.
If the EndKey query does not match it must be greater than the Prefix (as otherwise it would not have been greater than the FirstKey - so set to null.
* Fix unit test
Unit test had a typo - and result interpretation had a misunderstanding.
* Code and spec tidy
Also look to the cover the situation when the FirstKey is the same as the Prefix with tests.
This is, in theory, not an issue as it is the EndKey for each sublist which is indexed in leveled_tree. However, guard against it mapping to null here, just in case there are dangers lurking (note that tests will still pass without `M > N` guard in place.
* Hibernate on BIC complete
There are three situations when the BIC becomes complete:
- In a file created as part of a merge the BIS is learned in the merge
- After startup, files below L1 learn the block cache through reads that happen to read the block, eventually the while cache will be read, unless...
- Either before/after the cache is complete, it can get whiped by a timeout after a get_sqn request (e.g. as prompted by a journal compaction) ... it will then be re-filled of the back of get/get-range requests.
In all these situations we want to hibernate after the BIC is fill - to reflect the fact that the LoopState should now be relatively stable, so it is a good point to GC and rationalise location of data.
Previously on the the first base was covered. Now all three are covered through the bic_complete message.
* Test all index keys have same term
This works functionally, but is not optimised (the term is replicated in the index)
* Summaries with same index term
If the summary index all have the same index term - only the object keys need to be indexes
* Simplify case statements
We either match the pattern of <<Prefix:N, Suffix>> or the answer should be null
* OK for M == N
If M = N for the first key, it will have a suffix of <<>>. This will match (as expected) a query Start Key of the sam size, and be smaller than any query Start Key that has the same prefix.
If the query Start Key does not match the prefix - it will be null - as it must be smaller than the Prefix (as other wise the query Start Key would be bigger than the Last Key).
The constraint of M > N was introduced before the *_prefix_filter functions were checking the prefix, to avoid issues. Now the prefix is being checked, then M == N is ok.
* Simplify
Correct the test to use a binary field in the range.
To avoid further issue, only apply filter when everything is a binary() type.
* Add test for head_only mode
When leveled is used as a tictacaae key store (in parallel mode), the keys will be head_only entries. Double check they are handled as expected like object keys
* Revert previous change - must support typed buckets
Add assertion to confirm worthwhile optimisation
* Add support for configurable cache multiple (#375)
* Mas i370 patch e (#385)
Improvement to monitoring for efficiency and improved readability of logs and stats.
As part of this, where possible, tried to avoid updating loop state on READ messages in leveled processes (as was the case when tracking stats within each process).
No performance benefits found with change, but improved stats has helped discover other potential gains.
2022-12-18 20:18:03 +00:00
|
|
|
%% Is the cache too big - should it be flushed to on-disk Level 0
|
|
|
|
%% There exists some jitter to prevent all caches from flushing concurrently
|
|
|
|
%% where there are multiple leveled instances on one machine.
|
|
|
|
maybe_cache_too_big(NewL0Size, L0MaxSize, CoinToss) ->
|
|
|
|
CacheTooBig = NewL0Size > L0MaxSize,
|
|
|
|
CacheMuchTooBig =
|
|
|
|
NewL0Size > min(?SUPER_MAX_TABLE_SIZE, 2 * L0MaxSize),
|
|
|
|
RandomFactor =
|
|
|
|
case CoinToss of
|
|
|
|
true ->
|
|
|
|
case leveled_rand:uniform(?COIN_SIDECOUNT) of
|
|
|
|
1 ->
|
|
|
|
true;
|
|
|
|
_ ->
|
|
|
|
false
|
|
|
|
end;
|
|
|
|
false ->
|
|
|
|
true
|
|
|
|
end,
|
|
|
|
CacheTooBig and (RandomFactor or CacheMuchTooBig).
|
2019-01-14 12:27:51 +00:00
|
|
|
|
Develop 3.1 d30update (#386)
* Mas i370 patch d (#383)
* Refactor penciller memory
In high-volume tests on large key-count clusters, so significant variation in the P0031 time has been seen:
TimeBucket PatchA
a.0ms_to_1ms 18554
b.1ms_to_2ms 51778
c.2ms_to_3ms 696
d.3ms_to_5ms 220
e.5ms_to_8ms 59
f.8ms_to_13ms 40
g.13ms_to_21ms 364
h.21ms_to_34ms 277
i.34ms_to_55ms 34
j.55ms_to_89ms 17
k.89ms_to_144ms 21
l.144ms_to_233ms 31
m.233ms_to_377ms 45
n.377ms_to_610ms 52
o.610ms_to_987ms 59
p.987ms_to_1597ms 55
q.1597ms_to_2684ms 54
r.2684ms_to_4281ms 29
s.4281ms_to_6965ms 7
t.6295ms_to_11246ms 1
It is unclear why this varies so much. The time to add to the cache appears to be minimal (but perhaps there is an issue with timing points in the code), whereas the time to add to the index is much more significant and variable. There is also variable time when the memory is rolled (although the actual activity here appears to be minimal.
The refactoring here is two-fold:
- tidy and simplify by keeping LoopState managed within handle_call, and add more helpful dialyzer specs;
- change the update to the index to be a simple extension of a list, rather than any conversion.
This alternative version of the pmem index in unit test is orders of magnitude faster to add - and is the same order of magnitude to check. Anticipation is that it may be more efficient in terms of memory changes.
* Compress SST index
Reduces the size of the leveled_sst index with two changes:
1 - Where there is a common prefix of tuple elements (e.g. Bucket) across the whole leveled_sst file - only the non-common part is indexed, and a function is used to compare.
2 - There is less "indexing" of the index i.e. only 1 in 16 keys are passed into the gb_trees part instead of 1 in 4
* Immediate hibernate
Reasons for delay in hibernate were not clear.
Straight after creation the process will not be in receipt of messages (must wait for the manifest to be updated), so better to hibernate now. This also means the log PC023 provides more accurate information.
* Refactor BIC
This patch avoids the following:
- repeated replacement of the same element in the BIC (via get_kvrange), by checking presence via GET before sing SET
- Stops re-reading of all elements to discover high modified date
Also there appears to have been a bug where a missing HMD for the file is required to add to the cache. However, now the cache may be erased without erasing the HMD. This means that the cache can never be rebuilt
* Use correct size in test results
erts_debug:flat_size/1 returns size in words (i.e. 8 bytes on 64-bit CPU) not bytes
* Don't change summary record
As it is persisted as part of the file write, any change to the summary record cannot be rolled back
* Clerk to prompt L0 write
Simplifies the logic if the clerk request work for the penciller prompts L0 writes as well as Manifest changes.
The advantage now is that if the penciller memory is full, and PUT load stops, the clerk should still be able to prompt persistence. the penciller can therefore make use of dead time this way
* Add push on journal compact
If there has been a backlog, followed by a quiet period - there may be a large ledger cache left unpushed. Journal compaction events are about once per hour, so the performance overhead of a false push should be minimal, with the advantage of clearing any backlog before load starts again.
This is only relevant to riak users with very off/full batch type workloads.
* Extend tests
To more consistently trigger all overload scenarios
* Fix range keys smaller than prefix
Can't make end key an empty binary in this case, as it may be bigger than any keys within the range, but will appear to be smaller.
Unit tests and ct tests added to expose the potential issue
* Tidy-up
- Remove penciller logs which are no longer called
- Get pclerk to only wait MIN_TIMEOUT after doing work, in case there is a backlog
- Remove update_levelzero_cache function as it is unique to handle_call of push_mem, and simple enough to be inline
- Alight testutil slow offer with standard slow offer used
* Tidy-up
Remove pre-otp20 references.
Reinstate the check that the starting pid is still active, this was added to tidy up shutdown.
Resolve failure to run on otp20 due to `-if` sttaement
* Tidy up
Using null rather then {null, Key} is potentially clearer as it is not a concern what they Key is in this case, and removes a comparison step from the leveled_codec:endkey_passed/2 function.
There were issues with coverage in eunit tests as the leveled_pclerk shut down. This prompted a general tidy of leveled_pclerk (remove passing of LoopState into internal functions, and add dialyzer specs.
* Remove R16 relic
* Further testing another issue
The StartKey must always be less than or equal to the prefix when the first N characters are stripped, but this is not true of the EndKey (for the query) which does not have to be between the FirstKey and the LastKey.
If the EndKey query does not match it must be greater than the Prefix (as otherwise it would not have been greater than the FirstKey - so set to null.
* Fix unit test
Unit test had a typo - and result interpretation had a misunderstanding.
* Code and spec tidy
Also look to the cover the situation when the FirstKey is the same as the Prefix with tests.
This is, in theory, not an issue as it is the EndKey for each sublist which is indexed in leveled_tree. However, guard against it mapping to null here, just in case there are dangers lurking (note that tests will still pass without `M > N` guard in place.
* Hibernate on BIC complete
There are three situations when the BIC becomes complete:
- In a file created as part of a merge the BIS is learned in the merge
- After startup, files below L1 learn the block cache through reads that happen to read the block, eventually the while cache will be read, unless...
- Either before/after the cache is complete, it can get whiped by a timeout after a get_sqn request (e.g. as prompted by a journal compaction) ... it will then be re-filled of the back of get/get-range requests.
In all these situations we want to hibernate after the BIC is fill - to reflect the fact that the LoopState should now be relatively stable, so it is a good point to GC and rationalise location of data.
Previously on the the first base was covered. Now all three are covered through the bic_complete message.
* Test all index keys have same term
This works functionally, but is not optimised (the term is replicated in the index)
* Summaries with same index term
If the summary index all have the same index term - only the object keys need to be indexes
* Simplify case statements
We either match the pattern of <<Prefix:N, Suffix>> or the answer should be null
* OK for M == N
If M = N for the first key, it will have a suffix of <<>>. This will match (as expected) a query Start Key of the sam size, and be smaller than any query Start Key that has the same prefix.
If the query Start Key does not match the prefix - it will be null - as it must be smaller than the Prefix (as other wise the query Start Key would be bigger than the Last Key).
The constraint of M > N was introduced before the *_prefix_filter functions were checking the prefix, to avoid issues. Now the prefix is being checked, then M == N is ok.
* Simplify
Correct the test to use a binary field in the range.
To avoid further issue, only apply filter when everything is a binary() type.
* Add test for head_only mode
When leveled is used as a tictacaae key store (in parallel mode), the keys will be head_only entries. Double check they are handled as expected like object keys
* Revert previous change - must support typed buckets
Add assertion to confirm worthwhile optimisation
* Add support for configurable cache multiple (#375)
* Mas i370 patch e (#385)
Improvement to monitoring for efficiency and improved readability of logs and stats.
As part of this, where possible, tried to avoid updating loop state on READ messages in leveled processes (as was the case when tracking stats within each process).
No performance benefits found with change, but improved stats has helped discover other potential gains.
2022-12-18 20:18:03 +00:00
|
|
|
-spec roll_memory(
|
|
|
|
pos_integer(), non_neg_integer(), string(),
|
|
|
|
levelzero_cache()|none, pos_integer(),
|
|
|
|
sst_options(), boolean())
|
|
|
|
-> {pid(), leveled_ebloom:bloom()|none}.
|
2018-05-01 21:28:40 +01:00
|
|
|
%% @doc
|
|
|
|
%% Roll the in-memory cache into a L0 file. If this is done synchronously,
|
|
|
|
%% will return a bloom representing the contents of the file.
|
|
|
|
%%
|
Develop 3.1 d30update (#386)
* Mas i370 patch d (#383)
* Refactor penciller memory
In high-volume tests on large key-count clusters, so significant variation in the P0031 time has been seen:
TimeBucket PatchA
a.0ms_to_1ms 18554
b.1ms_to_2ms 51778
c.2ms_to_3ms 696
d.3ms_to_5ms 220
e.5ms_to_8ms 59
f.8ms_to_13ms 40
g.13ms_to_21ms 364
h.21ms_to_34ms 277
i.34ms_to_55ms 34
j.55ms_to_89ms 17
k.89ms_to_144ms 21
l.144ms_to_233ms 31
m.233ms_to_377ms 45
n.377ms_to_610ms 52
o.610ms_to_987ms 59
p.987ms_to_1597ms 55
q.1597ms_to_2684ms 54
r.2684ms_to_4281ms 29
s.4281ms_to_6965ms 7
t.6295ms_to_11246ms 1
It is unclear why this varies so much. The time to add to the cache appears to be minimal (but perhaps there is an issue with timing points in the code), whereas the time to add to the index is much more significant and variable. There is also variable time when the memory is rolled (although the actual activity here appears to be minimal.
The refactoring here is two-fold:
- tidy and simplify by keeping LoopState managed within handle_call, and add more helpful dialyzer specs;
- change the update to the index to be a simple extension of a list, rather than any conversion.
This alternative version of the pmem index in unit test is orders of magnitude faster to add - and is the same order of magnitude to check. Anticipation is that it may be more efficient in terms of memory changes.
* Compress SST index
Reduces the size of the leveled_sst index with two changes:
1 - Where there is a common prefix of tuple elements (e.g. Bucket) across the whole leveled_sst file - only the non-common part is indexed, and a function is used to compare.
2 - There is less "indexing" of the index i.e. only 1 in 16 keys are passed into the gb_trees part instead of 1 in 4
* Immediate hibernate
Reasons for delay in hibernate were not clear.
Straight after creation the process will not be in receipt of messages (must wait for the manifest to be updated), so better to hibernate now. This also means the log PC023 provides more accurate information.
* Refactor BIC
This patch avoids the following:
- repeated replacement of the same element in the BIC (via get_kvrange), by checking presence via GET before sing SET
- Stops re-reading of all elements to discover high modified date
Also there appears to have been a bug where a missing HMD for the file is required to add to the cache. However, now the cache may be erased without erasing the HMD. This means that the cache can never be rebuilt
* Use correct size in test results
erts_debug:flat_size/1 returns size in words (i.e. 8 bytes on 64-bit CPU) not bytes
* Don't change summary record
As it is persisted as part of the file write, any change to the summary record cannot be rolled back
* Clerk to prompt L0 write
Simplifies the logic if the clerk request work for the penciller prompts L0 writes as well as Manifest changes.
The advantage now is that if the penciller memory is full, and PUT load stops, the clerk should still be able to prompt persistence. the penciller can therefore make use of dead time this way
* Add push on journal compact
If there has been a backlog, followed by a quiet period - there may be a large ledger cache left unpushed. Journal compaction events are about once per hour, so the performance overhead of a false push should be minimal, with the advantage of clearing any backlog before load starts again.
This is only relevant to riak users with very off/full batch type workloads.
* Extend tests
To more consistently trigger all overload scenarios
* Fix range keys smaller than prefix
Can't make end key an empty binary in this case, as it may be bigger than any keys within the range, but will appear to be smaller.
Unit tests and ct tests added to expose the potential issue
* Tidy-up
- Remove penciller logs which are no longer called
- Get pclerk to only wait MIN_TIMEOUT after doing work, in case there is a backlog
- Remove update_levelzero_cache function as it is unique to handle_call of push_mem, and simple enough to be inline
- Alight testutil slow offer with standard slow offer used
* Tidy-up
Remove pre-otp20 references.
Reinstate the check that the starting pid is still active, this was added to tidy up shutdown.
Resolve failure to run on otp20 due to `-if` sttaement
* Tidy up
Using null rather then {null, Key} is potentially clearer as it is not a concern what they Key is in this case, and removes a comparison step from the leveled_codec:endkey_passed/2 function.
There were issues with coverage in eunit tests as the leveled_pclerk shut down. This prompted a general tidy of leveled_pclerk (remove passing of LoopState into internal functions, and add dialyzer specs.
* Remove R16 relic
* Further testing another issue
The StartKey must always be less than or equal to the prefix when the first N characters are stripped, but this is not true of the EndKey (for the query) which does not have to be between the FirstKey and the LastKey.
If the EndKey query does not match it must be greater than the Prefix (as otherwise it would not have been greater than the FirstKey - so set to null.
* Fix unit test
Unit test had a typo - and result interpretation had a misunderstanding.
* Code and spec tidy
Also look to the cover the situation when the FirstKey is the same as the Prefix with tests.
This is, in theory, not an issue as it is the EndKey for each sublist which is indexed in leveled_tree. However, guard against it mapping to null here, just in case there are dangers lurking (note that tests will still pass without `M > N` guard in place.
* Hibernate on BIC complete
There are three situations when the BIC becomes complete:
- In a file created as part of a merge the BIS is learned in the merge
- After startup, files below L1 learn the block cache through reads that happen to read the block, eventually the while cache will be read, unless...
- Either before/after the cache is complete, it can get whiped by a timeout after a get_sqn request (e.g. as prompted by a journal compaction) ... it will then be re-filled of the back of get/get-range requests.
In all these situations we want to hibernate after the BIC is fill - to reflect the fact that the LoopState should now be relatively stable, so it is a good point to GC and rationalise location of data.
Previously on the the first base was covered. Now all three are covered through the bic_complete message.
* Test all index keys have same term
This works functionally, but is not optimised (the term is replicated in the index)
* Summaries with same index term
If the summary index all have the same index term - only the object keys need to be indexes
* Simplify case statements
We either match the pattern of <<Prefix:N, Suffix>> or the answer should be null
* OK for M == N
If M = N for the first key, it will have a suffix of <<>>. This will match (as expected) a query Start Key of the sam size, and be smaller than any query Start Key that has the same prefix.
If the query Start Key does not match the prefix - it will be null - as it must be smaller than the Prefix (as other wise the query Start Key would be bigger than the Last Key).
The constraint of M > N was introduced before the *_prefix_filter functions were checking the prefix, to avoid issues. Now the prefix is being checked, then M == N is ok.
* Simplify
Correct the test to use a binary field in the range.
To avoid further issue, only apply filter when everything is a binary() type.
* Add test for head_only mode
When leveled is used as a tictacaae key store (in parallel mode), the keys will be head_only entries. Double check they are handled as expected like object keys
* Revert previous change - must support typed buckets
Add assertion to confirm worthwhile optimisation
* Add support for configurable cache multiple (#375)
* Mas i370 patch e (#385)
Improvement to monitoring for efficiency and improved readability of logs and stats.
As part of this, where possible, tried to avoid updating loop state on READ messages in leveled processes (as was the case when tracking stats within each process).
No performance benefits found with change, but improved stats has helped discover other potential gains.
2022-12-18 20:18:03 +00:00
|
|
|
%% Casting a large object (the levelzero cache) to the SST file does not lead
|
|
|
|
%% to an immediate return. With 32K keys in the TreeList it could take around
|
|
|
|
%% 35-40ms due to the overheads of copying.
|
2016-10-31 01:33:33 +00:00
|
|
|
%%
|
Develop 3.1 d30update (#386)
* Mas i370 patch d (#383)
* Refactor penciller memory
In high-volume tests on large key-count clusters, so significant variation in the P0031 time has been seen:
TimeBucket PatchA
a.0ms_to_1ms 18554
b.1ms_to_2ms 51778
c.2ms_to_3ms 696
d.3ms_to_5ms 220
e.5ms_to_8ms 59
f.8ms_to_13ms 40
g.13ms_to_21ms 364
h.21ms_to_34ms 277
i.34ms_to_55ms 34
j.55ms_to_89ms 17
k.89ms_to_144ms 21
l.144ms_to_233ms 31
m.233ms_to_377ms 45
n.377ms_to_610ms 52
o.610ms_to_987ms 59
p.987ms_to_1597ms 55
q.1597ms_to_2684ms 54
r.2684ms_to_4281ms 29
s.4281ms_to_6965ms 7
t.6295ms_to_11246ms 1
It is unclear why this varies so much. The time to add to the cache appears to be minimal (but perhaps there is an issue with timing points in the code), whereas the time to add to the index is much more significant and variable. There is also variable time when the memory is rolled (although the actual activity here appears to be minimal.
The refactoring here is two-fold:
- tidy and simplify by keeping LoopState managed within handle_call, and add more helpful dialyzer specs;
- change the update to the index to be a simple extension of a list, rather than any conversion.
This alternative version of the pmem index in unit test is orders of magnitude faster to add - and is the same order of magnitude to check. Anticipation is that it may be more efficient in terms of memory changes.
* Compress SST index
Reduces the size of the leveled_sst index with two changes:
1 - Where there is a common prefix of tuple elements (e.g. Bucket) across the whole leveled_sst file - only the non-common part is indexed, and a function is used to compare.
2 - There is less "indexing" of the index i.e. only 1 in 16 keys are passed into the gb_trees part instead of 1 in 4
* Immediate hibernate
Reasons for delay in hibernate were not clear.
Straight after creation the process will not be in receipt of messages (must wait for the manifest to be updated), so better to hibernate now. This also means the log PC023 provides more accurate information.
* Refactor BIC
This patch avoids the following:
- repeated replacement of the same element in the BIC (via get_kvrange), by checking presence via GET before sing SET
- Stops re-reading of all elements to discover high modified date
Also there appears to have been a bug where a missing HMD for the file is required to add to the cache. However, now the cache may be erased without erasing the HMD. This means that the cache can never be rebuilt
* Use correct size in test results
erts_debug:flat_size/1 returns size in words (i.e. 8 bytes on 64-bit CPU) not bytes
* Don't change summary record
As it is persisted as part of the file write, any change to the summary record cannot be rolled back
* Clerk to prompt L0 write
Simplifies the logic if the clerk request work for the penciller prompts L0 writes as well as Manifest changes.
The advantage now is that if the penciller memory is full, and PUT load stops, the clerk should still be able to prompt persistence. the penciller can therefore make use of dead time this way
* Add push on journal compact
If there has been a backlog, followed by a quiet period - there may be a large ledger cache left unpushed. Journal compaction events are about once per hour, so the performance overhead of a false push should be minimal, with the advantage of clearing any backlog before load starts again.
This is only relevant to riak users with very off/full batch type workloads.
* Extend tests
To more consistently trigger all overload scenarios
* Fix range keys smaller than prefix
Can't make end key an empty binary in this case, as it may be bigger than any keys within the range, but will appear to be smaller.
Unit tests and ct tests added to expose the potential issue
* Tidy-up
- Remove penciller logs which are no longer called
- Get pclerk to only wait MIN_TIMEOUT after doing work, in case there is a backlog
- Remove update_levelzero_cache function as it is unique to handle_call of push_mem, and simple enough to be inline
- Alight testutil slow offer with standard slow offer used
* Tidy-up
Remove pre-otp20 references.
Reinstate the check that the starting pid is still active, this was added to tidy up shutdown.
Resolve failure to run on otp20 due to `-if` sttaement
* Tidy up
Using null rather then {null, Key} is potentially clearer as it is not a concern what they Key is in this case, and removes a comparison step from the leveled_codec:endkey_passed/2 function.
There were issues with coverage in eunit tests as the leveled_pclerk shut down. This prompted a general tidy of leveled_pclerk (remove passing of LoopState into internal functions, and add dialyzer specs.
* Remove R16 relic
* Further testing another issue
The StartKey must always be less than or equal to the prefix when the first N characters are stripped, but this is not true of the EndKey (for the query) which does not have to be between the FirstKey and the LastKey.
If the EndKey query does not match it must be greater than the Prefix (as otherwise it would not have been greater than the FirstKey - so set to null.
* Fix unit test
Unit test had a typo - and result interpretation had a misunderstanding.
* Code and spec tidy
Also look to the cover the situation when the FirstKey is the same as the Prefix with tests.
This is, in theory, not an issue as it is the EndKey for each sublist which is indexed in leveled_tree. However, guard against it mapping to null here, just in case there are dangers lurking (note that tests will still pass without `M > N` guard in place.
* Hibernate on BIC complete
There are three situations when the BIC becomes complete:
- In a file created as part of a merge the BIS is learned in the merge
- After startup, files below L1 learn the block cache through reads that happen to read the block, eventually the while cache will be read, unless...
- Either before/after the cache is complete, it can get whiped by a timeout after a get_sqn request (e.g. as prompted by a journal compaction) ... it will then be re-filled of the back of get/get-range requests.
In all these situations we want to hibernate after the BIC is fill - to reflect the fact that the LoopState should now be relatively stable, so it is a good point to GC and rationalise location of data.
Previously on the the first base was covered. Now all three are covered through the bic_complete message.
* Test all index keys have same term
This works functionally, but is not optimised (the term is replicated in the index)
* Summaries with same index term
If the summary index all have the same index term - only the object keys need to be indexes
* Simplify case statements
We either match the pattern of <<Prefix:N, Suffix>> or the answer should be null
* OK for M == N
If M = N for the first key, it will have a suffix of <<>>. This will match (as expected) a query Start Key of the sam size, and be smaller than any query Start Key that has the same prefix.
If the query Start Key does not match the prefix - it will be null - as it must be smaller than the Prefix (as other wise the query Start Key would be bigger than the Last Key).
The constraint of M > N was introduced before the *_prefix_filter functions were checking the prefix, to avoid issues. Now the prefix is being checked, then M == N is ok.
* Simplify
Correct the test to use a binary field in the range.
To avoid further issue, only apply filter when everything is a binary() type.
* Add test for head_only mode
When leveled is used as a tictacaae key store (in parallel mode), the keys will be head_only entries. Double check they are handled as expected like object keys
* Revert previous change - must support typed buckets
Add assertion to confirm worthwhile optimisation
* Add support for configurable cache multiple (#375)
* Mas i370 patch e (#385)
Improvement to monitoring for efficiency and improved readability of logs and stats.
As part of this, where possible, tried to avoid updating loop state on READ messages in leveled processes (as was the case when tracking stats within each process).
No performance benefits found with change, but improved stats has helped discover other potential gains.
2022-12-18 20:18:03 +00:00
|
|
|
%% To avoid blocking the penciller, the SST file can request each item of the
|
2016-10-31 01:33:33 +00:00
|
|
|
%% cache one at a time.
|
|
|
|
%%
|
|
|
|
%% The Wait is set to false to use a cast when calling this in normal operation
|
|
|
|
%% where as the Wait of true is used at shutdown
|
Develop 3.1 d30update (#386)
* Mas i370 patch d (#383)
* Refactor penciller memory
In high-volume tests on large key-count clusters, so significant variation in the P0031 time has been seen:
TimeBucket PatchA
a.0ms_to_1ms 18554
b.1ms_to_2ms 51778
c.2ms_to_3ms 696
d.3ms_to_5ms 220
e.5ms_to_8ms 59
f.8ms_to_13ms 40
g.13ms_to_21ms 364
h.21ms_to_34ms 277
i.34ms_to_55ms 34
j.55ms_to_89ms 17
k.89ms_to_144ms 21
l.144ms_to_233ms 31
m.233ms_to_377ms 45
n.377ms_to_610ms 52
o.610ms_to_987ms 59
p.987ms_to_1597ms 55
q.1597ms_to_2684ms 54
r.2684ms_to_4281ms 29
s.4281ms_to_6965ms 7
t.6295ms_to_11246ms 1
It is unclear why this varies so much. The time to add to the cache appears to be minimal (but perhaps there is an issue with timing points in the code), whereas the time to add to the index is much more significant and variable. There is also variable time when the memory is rolled (although the actual activity here appears to be minimal.
The refactoring here is two-fold:
- tidy and simplify by keeping LoopState managed within handle_call, and add more helpful dialyzer specs;
- change the update to the index to be a simple extension of a list, rather than any conversion.
This alternative version of the pmem index in unit test is orders of magnitude faster to add - and is the same order of magnitude to check. Anticipation is that it may be more efficient in terms of memory changes.
* Compress SST index
Reduces the size of the leveled_sst index with two changes:
1 - Where there is a common prefix of tuple elements (e.g. Bucket) across the whole leveled_sst file - only the non-common part is indexed, and a function is used to compare.
2 - There is less "indexing" of the index i.e. only 1 in 16 keys are passed into the gb_trees part instead of 1 in 4
* Immediate hibernate
Reasons for delay in hibernate were not clear.
Straight after creation the process will not be in receipt of messages (must wait for the manifest to be updated), so better to hibernate now. This also means the log PC023 provides more accurate information.
* Refactor BIC
This patch avoids the following:
- repeated replacement of the same element in the BIC (via get_kvrange), by checking presence via GET before sing SET
- Stops re-reading of all elements to discover high modified date
Also there appears to have been a bug where a missing HMD for the file is required to add to the cache. However, now the cache may be erased without erasing the HMD. This means that the cache can never be rebuilt
* Use correct size in test results
erts_debug:flat_size/1 returns size in words (i.e. 8 bytes on 64-bit CPU) not bytes
* Don't change summary record
As it is persisted as part of the file write, any change to the summary record cannot be rolled back
* Clerk to prompt L0 write
Simplifies the logic if the clerk request work for the penciller prompts L0 writes as well as Manifest changes.
The advantage now is that if the penciller memory is full, and PUT load stops, the clerk should still be able to prompt persistence. the penciller can therefore make use of dead time this way
* Add push on journal compact
If there has been a backlog, followed by a quiet period - there may be a large ledger cache left unpushed. Journal compaction events are about once per hour, so the performance overhead of a false push should be minimal, with the advantage of clearing any backlog before load starts again.
This is only relevant to riak users with very off/full batch type workloads.
* Extend tests
To more consistently trigger all overload scenarios
* Fix range keys smaller than prefix
Can't make end key an empty binary in this case, as it may be bigger than any keys within the range, but will appear to be smaller.
Unit tests and ct tests added to expose the potential issue
* Tidy-up
- Remove penciller logs which are no longer called
- Get pclerk to only wait MIN_TIMEOUT after doing work, in case there is a backlog
- Remove update_levelzero_cache function as it is unique to handle_call of push_mem, and simple enough to be inline
- Alight testutil slow offer with standard slow offer used
* Tidy-up
Remove pre-otp20 references.
Reinstate the check that the starting pid is still active, this was added to tidy up shutdown.
Resolve failure to run on otp20 due to `-if` sttaement
* Tidy up
Using null rather then {null, Key} is potentially clearer as it is not a concern what they Key is in this case, and removes a comparison step from the leveled_codec:endkey_passed/2 function.
There were issues with coverage in eunit tests as the leveled_pclerk shut down. This prompted a general tidy of leveled_pclerk (remove passing of LoopState into internal functions, and add dialyzer specs.
* Remove R16 relic
* Further testing another issue
The StartKey must always be less than or equal to the prefix when the first N characters are stripped, but this is not true of the EndKey (for the query) which does not have to be between the FirstKey and the LastKey.
If the EndKey query does not match it must be greater than the Prefix (as otherwise it would not have been greater than the FirstKey - so set to null.
* Fix unit test
Unit test had a typo - and result interpretation had a misunderstanding.
* Code and spec tidy
Also look to the cover the situation when the FirstKey is the same as the Prefix with tests.
This is, in theory, not an issue as it is the EndKey for each sublist which is indexed in leveled_tree. However, guard against it mapping to null here, just in case there are dangers lurking (note that tests will still pass without `M > N` guard in place.
* Hibernate on BIC complete
There are three situations when the BIC becomes complete:
- In a file created as part of a merge the BIS is learned in the merge
- After startup, files below L1 learn the block cache through reads that happen to read the block, eventually the while cache will be read, unless...
- Either before/after the cache is complete, it can get whiped by a timeout after a get_sqn request (e.g. as prompted by a journal compaction) ... it will then be re-filled of the back of get/get-range requests.
In all these situations we want to hibernate after the BIC is fill - to reflect the fact that the LoopState should now be relatively stable, so it is a good point to GC and rationalise location of data.
Previously on the the first base was covered. Now all three are covered through the bic_complete message.
* Test all index keys have same term
This works functionally, but is not optimised (the term is replicated in the index)
* Summaries with same index term
If the summary index all have the same index term - only the object keys need to be indexes
* Simplify case statements
We either match the pattern of <<Prefix:N, Suffix>> or the answer should be null
* OK for M == N
If M = N for the first key, it will have a suffix of <<>>. This will match (as expected) a query Start Key of the sam size, and be smaller than any query Start Key that has the same prefix.
If the query Start Key does not match the prefix - it will be null - as it must be smaller than the Prefix (as other wise the query Start Key would be bigger than the Last Key).
The constraint of M > N was introduced before the *_prefix_filter functions were checking the prefix, to avoid issues. Now the prefix is being checked, then M == N is ok.
* Simplify
Correct the test to use a binary field in the range.
To avoid further issue, only apply filter when everything is a binary() type.
* Add test for head_only mode
When leveled is used as a tictacaae key store (in parallel mode), the keys will be head_only entries. Double check they are handled as expected like object keys
* Revert previous change - must support typed buckets
Add assertion to confirm worthwhile optimisation
* Add support for configurable cache multiple (#375)
* Mas i370 patch e (#385)
Improvement to monitoring for efficiency and improved readability of logs and stats.
As part of this, where possible, tried to avoid updating loop state on READ messages in leveled processes (as was the case when tracking stats within each process).
No performance benefits found with change, but improved stats has helped discover other potential gains.
2022-12-18 20:18:03 +00:00
|
|
|
roll_memory(NextManSQN, LedgerSQN, RootPath, none, CL, SSTOpts, false) ->
|
|
|
|
L0Path = sst_rootpath(RootPath),
|
|
|
|
L0FN = sst_filename(NextManSQN, 0, 0),
|
|
|
|
leveled_log:log(p0019, [L0FN, LedgerSQN]),
|
2016-10-31 01:33:33 +00:00
|
|
|
PCL = self(),
|
2019-02-26 18:16:47 +00:00
|
|
|
FetchFun =
|
|
|
|
fun(Slot, ReturnFun) -> pcl_fetchlevelzero(PCL, Slot, ReturnFun) end,
|
Develop 3.1 d30update (#386)
* Mas i370 patch d (#383)
* Refactor penciller memory
In high-volume tests on large key-count clusters, so significant variation in the P0031 time has been seen:
TimeBucket PatchA
a.0ms_to_1ms 18554
b.1ms_to_2ms 51778
c.2ms_to_3ms 696
d.3ms_to_5ms 220
e.5ms_to_8ms 59
f.8ms_to_13ms 40
g.13ms_to_21ms 364
h.21ms_to_34ms 277
i.34ms_to_55ms 34
j.55ms_to_89ms 17
k.89ms_to_144ms 21
l.144ms_to_233ms 31
m.233ms_to_377ms 45
n.377ms_to_610ms 52
o.610ms_to_987ms 59
p.987ms_to_1597ms 55
q.1597ms_to_2684ms 54
r.2684ms_to_4281ms 29
s.4281ms_to_6965ms 7
t.6295ms_to_11246ms 1
It is unclear why this varies so much. The time to add to the cache appears to be minimal (but perhaps there is an issue with timing points in the code), whereas the time to add to the index is much more significant and variable. There is also variable time when the memory is rolled (although the actual activity here appears to be minimal.
The refactoring here is two-fold:
- tidy and simplify by keeping LoopState managed within handle_call, and add more helpful dialyzer specs;
- change the update to the index to be a simple extension of a list, rather than any conversion.
This alternative version of the pmem index in unit test is orders of magnitude faster to add - and is the same order of magnitude to check. Anticipation is that it may be more efficient in terms of memory changes.
* Compress SST index
Reduces the size of the leveled_sst index with two changes:
1 - Where there is a common prefix of tuple elements (e.g. Bucket) across the whole leveled_sst file - only the non-common part is indexed, and a function is used to compare.
2 - There is less "indexing" of the index i.e. only 1 in 16 keys are passed into the gb_trees part instead of 1 in 4
* Immediate hibernate
Reasons for delay in hibernate were not clear.
Straight after creation the process will not be in receipt of messages (must wait for the manifest to be updated), so better to hibernate now. This also means the log PC023 provides more accurate information.
* Refactor BIC
This patch avoids the following:
- repeated replacement of the same element in the BIC (via get_kvrange), by checking presence via GET before sing SET
- Stops re-reading of all elements to discover high modified date
Also there appears to have been a bug where a missing HMD for the file is required to add to the cache. However, now the cache may be erased without erasing the HMD. This means that the cache can never be rebuilt
* Use correct size in test results
erts_debug:flat_size/1 returns size in words (i.e. 8 bytes on 64-bit CPU) not bytes
* Don't change summary record
As it is persisted as part of the file write, any change to the summary record cannot be rolled back
* Clerk to prompt L0 write
Simplifies the logic if the clerk request work for the penciller prompts L0 writes as well as Manifest changes.
The advantage now is that if the penciller memory is full, and PUT load stops, the clerk should still be able to prompt persistence. the penciller can therefore make use of dead time this way
* Add push on journal compact
If there has been a backlog, followed by a quiet period - there may be a large ledger cache left unpushed. Journal compaction events are about once per hour, so the performance overhead of a false push should be minimal, with the advantage of clearing any backlog before load starts again.
This is only relevant to riak users with very off/full batch type workloads.
* Extend tests
To more consistently trigger all overload scenarios
* Fix range keys smaller than prefix
Can't make end key an empty binary in this case, as it may be bigger than any keys within the range, but will appear to be smaller.
Unit tests and ct tests added to expose the potential issue
* Tidy-up
- Remove penciller logs which are no longer called
- Get pclerk to only wait MIN_TIMEOUT after doing work, in case there is a backlog
- Remove update_levelzero_cache function as it is unique to handle_call of push_mem, and simple enough to be inline
- Alight testutil slow offer with standard slow offer used
* Tidy-up
Remove pre-otp20 references.
Reinstate the check that the starting pid is still active, this was added to tidy up shutdown.
Resolve failure to run on otp20 due to `-if` sttaement
* Tidy up
Using null rather then {null, Key} is potentially clearer as it is not a concern what they Key is in this case, and removes a comparison step from the leveled_codec:endkey_passed/2 function.
There were issues with coverage in eunit tests as the leveled_pclerk shut down. This prompted a general tidy of leveled_pclerk (remove passing of LoopState into internal functions, and add dialyzer specs.
* Remove R16 relic
* Further testing another issue
The StartKey must always be less than or equal to the prefix when the first N characters are stripped, but this is not true of the EndKey (for the query) which does not have to be between the FirstKey and the LastKey.
If the EndKey query does not match it must be greater than the Prefix (as otherwise it would not have been greater than the FirstKey - so set to null.
* Fix unit test
Unit test had a typo - and result interpretation had a misunderstanding.
* Code and spec tidy
Also look to the cover the situation when the FirstKey is the same as the Prefix with tests.
This is, in theory, not an issue as it is the EndKey for each sublist which is indexed in leveled_tree. However, guard against it mapping to null here, just in case there are dangers lurking (note that tests will still pass without `M > N` guard in place.
* Hibernate on BIC complete
There are three situations when the BIC becomes complete:
- In a file created as part of a merge the BIS is learned in the merge
- After startup, files below L1 learn the block cache through reads that happen to read the block, eventually the while cache will be read, unless...
- Either before/after the cache is complete, it can get whiped by a timeout after a get_sqn request (e.g. as prompted by a journal compaction) ... it will then be re-filled of the back of get/get-range requests.
In all these situations we want to hibernate after the BIC is fill - to reflect the fact that the LoopState should now be relatively stable, so it is a good point to GC and rationalise location of data.
Previously on the the first base was covered. Now all three are covered through the bic_complete message.
* Test all index keys have same term
This works functionally, but is not optimised (the term is replicated in the index)
* Summaries with same index term
If the summary index all have the same index term - only the object keys need to be indexes
* Simplify case statements
We either match the pattern of <<Prefix:N, Suffix>> or the answer should be null
* OK for M == N
If M = N for the first key, it will have a suffix of <<>>. This will match (as expected) a query Start Key of the sam size, and be smaller than any query Start Key that has the same prefix.
If the query Start Key does not match the prefix - it will be null - as it must be smaller than the Prefix (as other wise the query Start Key would be bigger than the Last Key).
The constraint of M > N was introduced before the *_prefix_filter functions were checking the prefix, to avoid issues. Now the prefix is being checked, then M == N is ok.
* Simplify
Correct the test to use a binary field in the range.
To avoid further issue, only apply filter when everything is a binary() type.
* Add test for head_only mode
When leveled is used as a tictacaae key store (in parallel mode), the keys will be head_only entries. Double check they are handled as expected like object keys
* Revert previous change - must support typed buckets
Add assertion to confirm worthwhile optimisation
* Add support for configurable cache multiple (#375)
* Mas i370 patch e (#385)
Improvement to monitoring for efficiency and improved readability of logs and stats.
As part of this, where possible, tried to avoid updating loop state on READ messages in leveled processes (as was the case when tracking stats within each process).
No performance benefits found with change, but improved stats has helped discover other potential gains.
2022-12-18 20:18:03 +00:00
|
|
|
{ok, Constructor, _} =
|
|
|
|
leveled_sst:sst_newlevelzero(
|
|
|
|
L0Path, L0FN, CL, FetchFun, PCL, LedgerSQN, SSTOpts),
|
2017-11-28 01:19:30 +00:00
|
|
|
{Constructor, none};
|
Develop 3.1 d30update (#386)
* Mas i370 patch d (#383)
* Refactor penciller memory
In high-volume tests on large key-count clusters, so significant variation in the P0031 time has been seen:
TimeBucket PatchA
a.0ms_to_1ms 18554
b.1ms_to_2ms 51778
c.2ms_to_3ms 696
d.3ms_to_5ms 220
e.5ms_to_8ms 59
f.8ms_to_13ms 40
g.13ms_to_21ms 364
h.21ms_to_34ms 277
i.34ms_to_55ms 34
j.55ms_to_89ms 17
k.89ms_to_144ms 21
l.144ms_to_233ms 31
m.233ms_to_377ms 45
n.377ms_to_610ms 52
o.610ms_to_987ms 59
p.987ms_to_1597ms 55
q.1597ms_to_2684ms 54
r.2684ms_to_4281ms 29
s.4281ms_to_6965ms 7
t.6295ms_to_11246ms 1
It is unclear why this varies so much. The time to add to the cache appears to be minimal (but perhaps there is an issue with timing points in the code), whereas the time to add to the index is much more significant and variable. There is also variable time when the memory is rolled (although the actual activity here appears to be minimal.
The refactoring here is two-fold:
- tidy and simplify by keeping LoopState managed within handle_call, and add more helpful dialyzer specs;
- change the update to the index to be a simple extension of a list, rather than any conversion.
This alternative version of the pmem index in unit test is orders of magnitude faster to add - and is the same order of magnitude to check. Anticipation is that it may be more efficient in terms of memory changes.
* Compress SST index
Reduces the size of the leveled_sst index with two changes:
1 - Where there is a common prefix of tuple elements (e.g. Bucket) across the whole leveled_sst file - only the non-common part is indexed, and a function is used to compare.
2 - There is less "indexing" of the index i.e. only 1 in 16 keys are passed into the gb_trees part instead of 1 in 4
* Immediate hibernate
Reasons for delay in hibernate were not clear.
Straight after creation the process will not be in receipt of messages (must wait for the manifest to be updated), so better to hibernate now. This also means the log PC023 provides more accurate information.
* Refactor BIC
This patch avoids the following:
- repeated replacement of the same element in the BIC (via get_kvrange), by checking presence via GET before sing SET
- Stops re-reading of all elements to discover high modified date
Also there appears to have been a bug where a missing HMD for the file is required to add to the cache. However, now the cache may be erased without erasing the HMD. This means that the cache can never be rebuilt
* Use correct size in test results
erts_debug:flat_size/1 returns size in words (i.e. 8 bytes on 64-bit CPU) not bytes
* Don't change summary record
As it is persisted as part of the file write, any change to the summary record cannot be rolled back
* Clerk to prompt L0 write
Simplifies the logic if the clerk request work for the penciller prompts L0 writes as well as Manifest changes.
The advantage now is that if the penciller memory is full, and PUT load stops, the clerk should still be able to prompt persistence. the penciller can therefore make use of dead time this way
* Add push on journal compact
If there has been a backlog, followed by a quiet period - there may be a large ledger cache left unpushed. Journal compaction events are about once per hour, so the performance overhead of a false push should be minimal, with the advantage of clearing any backlog before load starts again.
This is only relevant to riak users with very off/full batch type workloads.
* Extend tests
To more consistently trigger all overload scenarios
* Fix range keys smaller than prefix
Can't make end key an empty binary in this case, as it may be bigger than any keys within the range, but will appear to be smaller.
Unit tests and ct tests added to expose the potential issue
* Tidy-up
- Remove penciller logs which are no longer called
- Get pclerk to only wait MIN_TIMEOUT after doing work, in case there is a backlog
- Remove update_levelzero_cache function as it is unique to handle_call of push_mem, and simple enough to be inline
- Alight testutil slow offer with standard slow offer used
* Tidy-up
Remove pre-otp20 references.
Reinstate the check that the starting pid is still active, this was added to tidy up shutdown.
Resolve failure to run on otp20 due to `-if` sttaement
* Tidy up
Using null rather then {null, Key} is potentially clearer as it is not a concern what they Key is in this case, and removes a comparison step from the leveled_codec:endkey_passed/2 function.
There were issues with coverage in eunit tests as the leveled_pclerk shut down. This prompted a general tidy of leveled_pclerk (remove passing of LoopState into internal functions, and add dialyzer specs.
* Remove R16 relic
* Further testing another issue
The StartKey must always be less than or equal to the prefix when the first N characters are stripped, but this is not true of the EndKey (for the query) which does not have to be between the FirstKey and the LastKey.
If the EndKey query does not match it must be greater than the Prefix (as otherwise it would not have been greater than the FirstKey - so set to null.
* Fix unit test
Unit test had a typo - and result interpretation had a misunderstanding.
* Code and spec tidy
Also look to the cover the situation when the FirstKey is the same as the Prefix with tests.
This is, in theory, not an issue as it is the EndKey for each sublist which is indexed in leveled_tree. However, guard against it mapping to null here, just in case there are dangers lurking (note that tests will still pass without `M > N` guard in place.
* Hibernate on BIC complete
There are three situations when the BIC becomes complete:
- In a file created as part of a merge the BIS is learned in the merge
- After startup, files below L1 learn the block cache through reads that happen to read the block, eventually the while cache will be read, unless...
- Either before/after the cache is complete, it can get whiped by a timeout after a get_sqn request (e.g. as prompted by a journal compaction) ... it will then be re-filled of the back of get/get-range requests.
In all these situations we want to hibernate after the BIC is fill - to reflect the fact that the LoopState should now be relatively stable, so it is a good point to GC and rationalise location of data.
Previously on the the first base was covered. Now all three are covered through the bic_complete message.
* Test all index keys have same term
This works functionally, but is not optimised (the term is replicated in the index)
* Summaries with same index term
If the summary index all have the same index term - only the object keys need to be indexes
* Simplify case statements
We either match the pattern of <<Prefix:N, Suffix>> or the answer should be null
* OK for M == N
If M = N for the first key, it will have a suffix of <<>>. This will match (as expected) a query Start Key of the sam size, and be smaller than any query Start Key that has the same prefix.
If the query Start Key does not match the prefix - it will be null - as it must be smaller than the Prefix (as other wise the query Start Key would be bigger than the Last Key).
The constraint of M > N was introduced before the *_prefix_filter functions were checking the prefix, to avoid issues. Now the prefix is being checked, then M == N is ok.
* Simplify
Correct the test to use a binary field in the range.
To avoid further issue, only apply filter when everything is a binary() type.
* Add test for head_only mode
When leveled is used as a tictacaae key store (in parallel mode), the keys will be head_only entries. Double check they are handled as expected like object keys
* Revert previous change - must support typed buckets
Add assertion to confirm worthwhile optimisation
* Add support for configurable cache multiple (#375)
* Mas i370 patch e (#385)
Improvement to monitoring for efficiency and improved readability of logs and stats.
As part of this, where possible, tried to avoid updating loop state on READ messages in leveled processes (as was the case when tracking stats within each process).
No performance benefits found with change, but improved stats has helped discover other potential gains.
2022-12-18 20:18:03 +00:00
|
|
|
roll_memory(NextManSQN, LedgerSQN, RootPath, L0Cache, CL, SSTOpts, true) ->
|
|
|
|
L0Path = sst_rootpath(RootPath),
|
|
|
|
L0FN = sst_filename(NextManSQN, 0, 0),
|
|
|
|
FetchFun = fun(Slot) -> lists:nth(Slot, L0Cache) end,
|
|
|
|
KVList = leveled_pmem:to_list(CL, FetchFun),
|
|
|
|
{ok, Constructor, _, Bloom} =
|
|
|
|
leveled_sst:sst_new(
|
|
|
|
L0Path, L0FN, 0, KVList, LedgerSQN, SSTOpts),
|
2017-11-28 01:19:30 +00:00
|
|
|
{Constructor, Bloom}.
|
2016-10-20 02:23:45 +01:00
|
|
|
|
2018-05-01 21:28:40 +01:00
|
|
|
|
Develop 3.1 d30update (#386)
* Mas i370 patch d (#383)
* Refactor penciller memory
In high-volume tests on large key-count clusters, so significant variation in the P0031 time has been seen:
TimeBucket PatchA
a.0ms_to_1ms 18554
b.1ms_to_2ms 51778
c.2ms_to_3ms 696
d.3ms_to_5ms 220
e.5ms_to_8ms 59
f.8ms_to_13ms 40
g.13ms_to_21ms 364
h.21ms_to_34ms 277
i.34ms_to_55ms 34
j.55ms_to_89ms 17
k.89ms_to_144ms 21
l.144ms_to_233ms 31
m.233ms_to_377ms 45
n.377ms_to_610ms 52
o.610ms_to_987ms 59
p.987ms_to_1597ms 55
q.1597ms_to_2684ms 54
r.2684ms_to_4281ms 29
s.4281ms_to_6965ms 7
t.6295ms_to_11246ms 1
It is unclear why this varies so much. The time to add to the cache appears to be minimal (but perhaps there is an issue with timing points in the code), whereas the time to add to the index is much more significant and variable. There is also variable time when the memory is rolled (although the actual activity here appears to be minimal.
The refactoring here is two-fold:
- tidy and simplify by keeping LoopState managed within handle_call, and add more helpful dialyzer specs;
- change the update to the index to be a simple extension of a list, rather than any conversion.
This alternative version of the pmem index in unit test is orders of magnitude faster to add - and is the same order of magnitude to check. Anticipation is that it may be more efficient in terms of memory changes.
* Compress SST index
Reduces the size of the leveled_sst index with two changes:
1 - Where there is a common prefix of tuple elements (e.g. Bucket) across the whole leveled_sst file - only the non-common part is indexed, and a function is used to compare.
2 - There is less "indexing" of the index i.e. only 1 in 16 keys are passed into the gb_trees part instead of 1 in 4
* Immediate hibernate
Reasons for delay in hibernate were not clear.
Straight after creation the process will not be in receipt of messages (must wait for the manifest to be updated), so better to hibernate now. This also means the log PC023 provides more accurate information.
* Refactor BIC
This patch avoids the following:
- repeated replacement of the same element in the BIC (via get_kvrange), by checking presence via GET before sing SET
- Stops re-reading of all elements to discover high modified date
Also there appears to have been a bug where a missing HMD for the file is required to add to the cache. However, now the cache may be erased without erasing the HMD. This means that the cache can never be rebuilt
* Use correct size in test results
erts_debug:flat_size/1 returns size in words (i.e. 8 bytes on 64-bit CPU) not bytes
* Don't change summary record
As it is persisted as part of the file write, any change to the summary record cannot be rolled back
* Clerk to prompt L0 write
Simplifies the logic if the clerk request work for the penciller prompts L0 writes as well as Manifest changes.
The advantage now is that if the penciller memory is full, and PUT load stops, the clerk should still be able to prompt persistence. the penciller can therefore make use of dead time this way
* Add push on journal compact
If there has been a backlog, followed by a quiet period - there may be a large ledger cache left unpushed. Journal compaction events are about once per hour, so the performance overhead of a false push should be minimal, with the advantage of clearing any backlog before load starts again.
This is only relevant to riak users with very off/full batch type workloads.
* Extend tests
To more consistently trigger all overload scenarios
* Fix range keys smaller than prefix
Can't make end key an empty binary in this case, as it may be bigger than any keys within the range, but will appear to be smaller.
Unit tests and ct tests added to expose the potential issue
* Tidy-up
- Remove penciller logs which are no longer called
- Get pclerk to only wait MIN_TIMEOUT after doing work, in case there is a backlog
- Remove update_levelzero_cache function as it is unique to handle_call of push_mem, and simple enough to be inline
- Alight testutil slow offer with standard slow offer used
* Tidy-up
Remove pre-otp20 references.
Reinstate the check that the starting pid is still active, this was added to tidy up shutdown.
Resolve failure to run on otp20 due to `-if` sttaement
* Tidy up
Using null rather then {null, Key} is potentially clearer as it is not a concern what they Key is in this case, and removes a comparison step from the leveled_codec:endkey_passed/2 function.
There were issues with coverage in eunit tests as the leveled_pclerk shut down. This prompted a general tidy of leveled_pclerk (remove passing of LoopState into internal functions, and add dialyzer specs.
* Remove R16 relic
* Further testing another issue
The StartKey must always be less than or equal to the prefix when the first N characters are stripped, but this is not true of the EndKey (for the query) which does not have to be between the FirstKey and the LastKey.
If the EndKey query does not match it must be greater than the Prefix (as otherwise it would not have been greater than the FirstKey - so set to null.
* Fix unit test
Unit test had a typo - and result interpretation had a misunderstanding.
* Code and spec tidy
Also look to the cover the situation when the FirstKey is the same as the Prefix with tests.
This is, in theory, not an issue as it is the EndKey for each sublist which is indexed in leveled_tree. However, guard against it mapping to null here, just in case there are dangers lurking (note that tests will still pass without `M > N` guard in place.
* Hibernate on BIC complete
There are three situations when the BIC becomes complete:
- In a file created as part of a merge the BIS is learned in the merge
- After startup, files below L1 learn the block cache through reads that happen to read the block, eventually the while cache will be read, unless...
- Either before/after the cache is complete, it can get whiped by a timeout after a get_sqn request (e.g. as prompted by a journal compaction) ... it will then be re-filled of the back of get/get-range requests.
In all these situations we want to hibernate after the BIC is fill - to reflect the fact that the LoopState should now be relatively stable, so it is a good point to GC and rationalise location of data.
Previously on the the first base was covered. Now all three are covered through the bic_complete message.
* Test all index keys have same term
This works functionally, but is not optimised (the term is replicated in the index)
* Summaries with same index term
If the summary index all have the same index term - only the object keys need to be indexes
* Simplify case statements
We either match the pattern of <<Prefix:N, Suffix>> or the answer should be null
* OK for M == N
If M = N for the first key, it will have a suffix of <<>>. This will match (as expected) a query Start Key of the sam size, and be smaller than any query Start Key that has the same prefix.
If the query Start Key does not match the prefix - it will be null - as it must be smaller than the Prefix (as other wise the query Start Key would be bigger than the Last Key).
The constraint of M > N was introduced before the *_prefix_filter functions were checking the prefix, to avoid issues. Now the prefix is being checked, then M == N is ok.
* Simplify
Correct the test to use a binary field in the range.
To avoid further issue, only apply filter when everything is a binary() type.
* Add test for head_only mode
When leveled is used as a tictacaae key store (in parallel mode), the keys will be head_only entries. Double check they are handled as expected like object keys
* Revert previous change - must support typed buckets
Add assertion to confirm worthwhile optimisation
* Add support for configurable cache multiple (#375)
* Mas i370 patch e (#385)
Improvement to monitoring for efficiency and improved readability of logs and stats.
As part of this, where possible, tried to avoid updating loop state on READ messages in leveled processes (as was the case when tracking stats within each process).
No performance benefits found with change, but improved stats has helped discover other potential gains.
2022-12-18 20:18:03 +00:00
|
|
|
-spec timed_fetch_mem(
|
|
|
|
tuple(),
|
|
|
|
{integer(), integer()},
|
|
|
|
leveled_pmanifest:manifest(), list(),
|
|
|
|
leveled_pmem:index_array(),
|
|
|
|
leveled_monitor:monitor()) -> leveled_codec:ledger_kv()|not_found.
|
2018-05-01 21:28:40 +01:00
|
|
|
%% @doc
|
|
|
|
%% Fetch the result from the penciller, starting by looking in the memory,
|
|
|
|
%% and if it is not found looking down level by level through the LSM tree.
|
|
|
|
%%
|
|
|
|
%% This allows for the request to be timed, and the timing result to be added
|
|
|
|
%% to the aggregate timings - so that timinings per level can be logged and
|
|
|
|
%% the cost of requests dropping levels can be monitored.
|
|
|
|
%%
|
|
|
|
%% the result tuple includes the level at which the result was found.
|
Develop 3.1 d30update (#386)
* Mas i370 patch d (#383)
* Refactor penciller memory
In high-volume tests on large key-count clusters, so significant variation in the P0031 time has been seen:
TimeBucket PatchA
a.0ms_to_1ms 18554
b.1ms_to_2ms 51778
c.2ms_to_3ms 696
d.3ms_to_5ms 220
e.5ms_to_8ms 59
f.8ms_to_13ms 40
g.13ms_to_21ms 364
h.21ms_to_34ms 277
i.34ms_to_55ms 34
j.55ms_to_89ms 17
k.89ms_to_144ms 21
l.144ms_to_233ms 31
m.233ms_to_377ms 45
n.377ms_to_610ms 52
o.610ms_to_987ms 59
p.987ms_to_1597ms 55
q.1597ms_to_2684ms 54
r.2684ms_to_4281ms 29
s.4281ms_to_6965ms 7
t.6295ms_to_11246ms 1
It is unclear why this varies so much. The time to add to the cache appears to be minimal (but perhaps there is an issue with timing points in the code), whereas the time to add to the index is much more significant and variable. There is also variable time when the memory is rolled (although the actual activity here appears to be minimal.
The refactoring here is two-fold:
- tidy and simplify by keeping LoopState managed within handle_call, and add more helpful dialyzer specs;
- change the update to the index to be a simple extension of a list, rather than any conversion.
This alternative version of the pmem index in unit test is orders of magnitude faster to add - and is the same order of magnitude to check. Anticipation is that it may be more efficient in terms of memory changes.
* Compress SST index
Reduces the size of the leveled_sst index with two changes:
1 - Where there is a common prefix of tuple elements (e.g. Bucket) across the whole leveled_sst file - only the non-common part is indexed, and a function is used to compare.
2 - There is less "indexing" of the index i.e. only 1 in 16 keys are passed into the gb_trees part instead of 1 in 4
* Immediate hibernate
Reasons for delay in hibernate were not clear.
Straight after creation the process will not be in receipt of messages (must wait for the manifest to be updated), so better to hibernate now. This also means the log PC023 provides more accurate information.
* Refactor BIC
This patch avoids the following:
- repeated replacement of the same element in the BIC (via get_kvrange), by checking presence via GET before sing SET
- Stops re-reading of all elements to discover high modified date
Also there appears to have been a bug where a missing HMD for the file is required to add to the cache. However, now the cache may be erased without erasing the HMD. This means that the cache can never be rebuilt
* Use correct size in test results
erts_debug:flat_size/1 returns size in words (i.e. 8 bytes on 64-bit CPU) not bytes
* Don't change summary record
As it is persisted as part of the file write, any change to the summary record cannot be rolled back
* Clerk to prompt L0 write
Simplifies the logic if the clerk request work for the penciller prompts L0 writes as well as Manifest changes.
The advantage now is that if the penciller memory is full, and PUT load stops, the clerk should still be able to prompt persistence. the penciller can therefore make use of dead time this way
* Add push on journal compact
If there has been a backlog, followed by a quiet period - there may be a large ledger cache left unpushed. Journal compaction events are about once per hour, so the performance overhead of a false push should be minimal, with the advantage of clearing any backlog before load starts again.
This is only relevant to riak users with very off/full batch type workloads.
* Extend tests
To more consistently trigger all overload scenarios
* Fix range keys smaller than prefix
Can't make end key an empty binary in this case, as it may be bigger than any keys within the range, but will appear to be smaller.
Unit tests and ct tests added to expose the potential issue
* Tidy-up
- Remove penciller logs which are no longer called
- Get pclerk to only wait MIN_TIMEOUT after doing work, in case there is a backlog
- Remove update_levelzero_cache function as it is unique to handle_call of push_mem, and simple enough to be inline
- Alight testutil slow offer with standard slow offer used
* Tidy-up
Remove pre-otp20 references.
Reinstate the check that the starting pid is still active, this was added to tidy up shutdown.
Resolve failure to run on otp20 due to `-if` sttaement
* Tidy up
Using null rather then {null, Key} is potentially clearer as it is not a concern what they Key is in this case, and removes a comparison step from the leveled_codec:endkey_passed/2 function.
There were issues with coverage in eunit tests as the leveled_pclerk shut down. This prompted a general tidy of leveled_pclerk (remove passing of LoopState into internal functions, and add dialyzer specs.
* Remove R16 relic
* Further testing another issue
The StartKey must always be less than or equal to the prefix when the first N characters are stripped, but this is not true of the EndKey (for the query) which does not have to be between the FirstKey and the LastKey.
If the EndKey query does not match it must be greater than the Prefix (as otherwise it would not have been greater than the FirstKey - so set to null.
* Fix unit test
Unit test had a typo - and result interpretation had a misunderstanding.
* Code and spec tidy
Also look to the cover the situation when the FirstKey is the same as the Prefix with tests.
This is, in theory, not an issue as it is the EndKey for each sublist which is indexed in leveled_tree. However, guard against it mapping to null here, just in case there are dangers lurking (note that tests will still pass without `M > N` guard in place.
* Hibernate on BIC complete
There are three situations when the BIC becomes complete:
- In a file created as part of a merge the BIS is learned in the merge
- After startup, files below L1 learn the block cache through reads that happen to read the block, eventually the while cache will be read, unless...
- Either before/after the cache is complete, it can get whiped by a timeout after a get_sqn request (e.g. as prompted by a journal compaction) ... it will then be re-filled of the back of get/get-range requests.
In all these situations we want to hibernate after the BIC is fill - to reflect the fact that the LoopState should now be relatively stable, so it is a good point to GC and rationalise location of data.
Previously on the the first base was covered. Now all three are covered through the bic_complete message.
* Test all index keys have same term
This works functionally, but is not optimised (the term is replicated in the index)
* Summaries with same index term
If the summary index all have the same index term - only the object keys need to be indexes
* Simplify case statements
We either match the pattern of <<Prefix:N, Suffix>> or the answer should be null
* OK for M == N
If M = N for the first key, it will have a suffix of <<>>. This will match (as expected) a query Start Key of the sam size, and be smaller than any query Start Key that has the same prefix.
If the query Start Key does not match the prefix - it will be null - as it must be smaller than the Prefix (as other wise the query Start Key would be bigger than the Last Key).
The constraint of M > N was introduced before the *_prefix_filter functions were checking the prefix, to avoid issues. Now the prefix is being checked, then M == N is ok.
* Simplify
Correct the test to use a binary field in the range.
To avoid further issue, only apply filter when everything is a binary() type.
* Add test for head_only mode
When leveled is used as a tictacaae key store (in parallel mode), the keys will be head_only entries. Double check they are handled as expected like object keys
* Revert previous change - must support typed buckets
Add assertion to confirm worthwhile optimisation
* Add support for configurable cache multiple (#375)
* Mas i370 patch e (#385)
Improvement to monitoring for efficiency and improved readability of logs and stats.
As part of this, where possible, tried to avoid updating loop state on READ messages in leveled processes (as was the case when tracking stats within each process).
No performance benefits found with change, but improved stats has helped discover other potential gains.
2022-12-18 20:18:03 +00:00
|
|
|
timed_fetch_mem(Key, Hash, Manifest, L0Cache, L0Index, Monitor) ->
|
|
|
|
SW0 = leveled_monitor:maybe_time(Monitor),
|
Mas i370 d31 sstmemory (#373)
* Don't use fetch_cache below the page_cache level
* Don't time fetches due to SQN checks
SQN checks are all background processes
* Hibernate on SQN check
SQN check in the penciller is used for journal (all object) folds, but mainly for journal compaction. Use this to trigger hibernation where SST files stay quiet after the compaction check.
* Add catch for hibernate timeout
* Scale cache_size with level
Based on volume testing. Relatively speaking, far higher value to be gained from caches at higher levels (lower numbered levels). The cache at lower levels are proportionally much less efficient. so cache more at higher levels, where there is value, and less at lower levels where there is more cost relative to value.
* OTP 24 fix to cherry-pick
* Make minimal change to previous setup
Making significant change appears to not have had the expected positive improvement - so a more minimal change is proposed.
The assumption is that the cache only really gets used for double reads in the write path (e.g. where the application reads before a write) - and so a large cache make minimal difference, but no cache still has a downside.
* Introduce new types
* Mas i370 d30 sstmemory (#374)
* Don't time fetches due to SQN checks
SQN checks are all background processes
* Hibernate on SQN check
SQN check in the penciller is used for journal (all object) folds, but mainly for journal compaction. Use this to trigger hibernation where SST files stay quiet after the compaction check.
* Add catch for hibernate timeout
* Scale cache_size with level
Based on volume testing. Relatively speaking, far higher value to be gained from caches at higher levels (lower numbered levels). The cache at lower levels are proportionally much less efficient. so cache more at higher levels, where there is value, and less at lower levels where there is more cost relative to value.
* Make minimal change to previous setup
Making significant change appears to not have had the expected positive improvement - so a more minimal change is proposed.
The assumption is that the cache only really gets used for double reads in the write path (e.g. where the application reads before a write) - and so a large cache make minimal difference, but no cache still has a downside.
* Introduce new types
* More memory management
Clear blockindex_cache on timeout, and manually GC on pclerk after work.
* Add further garbage collection prompt
After fetching level zero, significant change in references in the penciller memory, so prompt a garbage_collect() at this point.
2022-04-23 13:38:20 +01:00
|
|
|
{R, Level} =
|
|
|
|
fetch_mem(Key, Hash, Manifest, L0Cache, L0Index, fun timed_sst_get/4),
|
Develop 3.1 d30update (#386)
* Mas i370 patch d (#383)
* Refactor penciller memory
In high-volume tests on large key-count clusters, so significant variation in the P0031 time has been seen:
TimeBucket PatchA
a.0ms_to_1ms 18554
b.1ms_to_2ms 51778
c.2ms_to_3ms 696
d.3ms_to_5ms 220
e.5ms_to_8ms 59
f.8ms_to_13ms 40
g.13ms_to_21ms 364
h.21ms_to_34ms 277
i.34ms_to_55ms 34
j.55ms_to_89ms 17
k.89ms_to_144ms 21
l.144ms_to_233ms 31
m.233ms_to_377ms 45
n.377ms_to_610ms 52
o.610ms_to_987ms 59
p.987ms_to_1597ms 55
q.1597ms_to_2684ms 54
r.2684ms_to_4281ms 29
s.4281ms_to_6965ms 7
t.6295ms_to_11246ms 1
It is unclear why this varies so much. The time to add to the cache appears to be minimal (but perhaps there is an issue with timing points in the code), whereas the time to add to the index is much more significant and variable. There is also variable time when the memory is rolled (although the actual activity here appears to be minimal.
The refactoring here is two-fold:
- tidy and simplify by keeping LoopState managed within handle_call, and add more helpful dialyzer specs;
- change the update to the index to be a simple extension of a list, rather than any conversion.
This alternative version of the pmem index in unit test is orders of magnitude faster to add - and is the same order of magnitude to check. Anticipation is that it may be more efficient in terms of memory changes.
* Compress SST index
Reduces the size of the leveled_sst index with two changes:
1 - Where there is a common prefix of tuple elements (e.g. Bucket) across the whole leveled_sst file - only the non-common part is indexed, and a function is used to compare.
2 - There is less "indexing" of the index i.e. only 1 in 16 keys are passed into the gb_trees part instead of 1 in 4
* Immediate hibernate
Reasons for delay in hibernate were not clear.
Straight after creation the process will not be in receipt of messages (must wait for the manifest to be updated), so better to hibernate now. This also means the log PC023 provides more accurate information.
* Refactor BIC
This patch avoids the following:
- repeated replacement of the same element in the BIC (via get_kvrange), by checking presence via GET before sing SET
- Stops re-reading of all elements to discover high modified date
Also there appears to have been a bug where a missing HMD for the file is required to add to the cache. However, now the cache may be erased without erasing the HMD. This means that the cache can never be rebuilt
* Use correct size in test results
erts_debug:flat_size/1 returns size in words (i.e. 8 bytes on 64-bit CPU) not bytes
* Don't change summary record
As it is persisted as part of the file write, any change to the summary record cannot be rolled back
* Clerk to prompt L0 write
Simplifies the logic if the clerk request work for the penciller prompts L0 writes as well as Manifest changes.
The advantage now is that if the penciller memory is full, and PUT load stops, the clerk should still be able to prompt persistence. the penciller can therefore make use of dead time this way
* Add push on journal compact
If there has been a backlog, followed by a quiet period - there may be a large ledger cache left unpushed. Journal compaction events are about once per hour, so the performance overhead of a false push should be minimal, with the advantage of clearing any backlog before load starts again.
This is only relevant to riak users with very off/full batch type workloads.
* Extend tests
To more consistently trigger all overload scenarios
* Fix range keys smaller than prefix
Can't make end key an empty binary in this case, as it may be bigger than any keys within the range, but will appear to be smaller.
Unit tests and ct tests added to expose the potential issue
* Tidy-up
- Remove penciller logs which are no longer called
- Get pclerk to only wait MIN_TIMEOUT after doing work, in case there is a backlog
- Remove update_levelzero_cache function as it is unique to handle_call of push_mem, and simple enough to be inline
- Alight testutil slow offer with standard slow offer used
* Tidy-up
Remove pre-otp20 references.
Reinstate the check that the starting pid is still active, this was added to tidy up shutdown.
Resolve failure to run on otp20 due to `-if` sttaement
* Tidy up
Using null rather then {null, Key} is potentially clearer as it is not a concern what they Key is in this case, and removes a comparison step from the leveled_codec:endkey_passed/2 function.
There were issues with coverage in eunit tests as the leveled_pclerk shut down. This prompted a general tidy of leveled_pclerk (remove passing of LoopState into internal functions, and add dialyzer specs.
* Remove R16 relic
* Further testing another issue
The StartKey must always be less than or equal to the prefix when the first N characters are stripped, but this is not true of the EndKey (for the query) which does not have to be between the FirstKey and the LastKey.
If the EndKey query does not match it must be greater than the Prefix (as otherwise it would not have been greater than the FirstKey - so set to null.
* Fix unit test
Unit test had a typo - and result interpretation had a misunderstanding.
* Code and spec tidy
Also look to the cover the situation when the FirstKey is the same as the Prefix with tests.
This is, in theory, not an issue as it is the EndKey for each sublist which is indexed in leveled_tree. However, guard against it mapping to null here, just in case there are dangers lurking (note that tests will still pass without `M > N` guard in place.
* Hibernate on BIC complete
There are three situations when the BIC becomes complete:
- In a file created as part of a merge the BIS is learned in the merge
- After startup, files below L1 learn the block cache through reads that happen to read the block, eventually the while cache will be read, unless...
- Either before/after the cache is complete, it can get whiped by a timeout after a get_sqn request (e.g. as prompted by a journal compaction) ... it will then be re-filled of the back of get/get-range requests.
In all these situations we want to hibernate after the BIC is fill - to reflect the fact that the LoopState should now be relatively stable, so it is a good point to GC and rationalise location of data.
Previously on the the first base was covered. Now all three are covered through the bic_complete message.
* Test all index keys have same term
This works functionally, but is not optimised (the term is replicated in the index)
* Summaries with same index term
If the summary index all have the same index term - only the object keys need to be indexes
* Simplify case statements
We either match the pattern of <<Prefix:N, Suffix>> or the answer should be null
* OK for M == N
If M = N for the first key, it will have a suffix of <<>>. This will match (as expected) a query Start Key of the sam size, and be smaller than any query Start Key that has the same prefix.
If the query Start Key does not match the prefix - it will be null - as it must be smaller than the Prefix (as other wise the query Start Key would be bigger than the Last Key).
The constraint of M > N was introduced before the *_prefix_filter functions were checking the prefix, to avoid issues. Now the prefix is being checked, then M == N is ok.
* Simplify
Correct the test to use a binary field in the range.
To avoid further issue, only apply filter when everything is a binary() type.
* Add test for head_only mode
When leveled is used as a tictacaae key store (in parallel mode), the keys will be head_only entries. Double check they are handled as expected like object keys
* Revert previous change - must support typed buckets
Add assertion to confirm worthwhile optimisation
* Add support for configurable cache multiple (#375)
* Mas i370 patch e (#385)
Improvement to monitoring for efficiency and improved readability of logs and stats.
As part of this, where possible, tried to avoid updating loop state on READ messages in leveled processes (as was the case when tracking stats within each process).
No performance benefits found with change, but improved stats has helped discover other potential gains.
2022-12-18 20:18:03 +00:00
|
|
|
{TS0, _SW1} = leveled_monitor:step_time(SW0),
|
|
|
|
maybelog_fetch_timing(Monitor, Level, TS0, R == not_present),
|
|
|
|
R.
|
2016-09-26 10:55:08 +01:00
|
|
|
|
2018-05-01 21:28:40 +01:00
|
|
|
|
Mas i370 d31 sstmemory (#373)
* Don't use fetch_cache below the page_cache level
* Don't time fetches due to SQN checks
SQN checks are all background processes
* Hibernate on SQN check
SQN check in the penciller is used for journal (all object) folds, but mainly for journal compaction. Use this to trigger hibernation where SST files stay quiet after the compaction check.
* Add catch for hibernate timeout
* Scale cache_size with level
Based on volume testing. Relatively speaking, far higher value to be gained from caches at higher levels (lower numbered levels). The cache at lower levels are proportionally much less efficient. so cache more at higher levels, where there is value, and less at lower levels where there is more cost relative to value.
* OTP 24 fix to cherry-pick
* Make minimal change to previous setup
Making significant change appears to not have had the expected positive improvement - so a more minimal change is proposed.
The assumption is that the cache only really gets used for double reads in the write path (e.g. where the application reads before a write) - and so a large cache make minimal difference, but no cache still has a downside.
* Introduce new types
* Mas i370 d30 sstmemory (#374)
* Don't time fetches due to SQN checks
SQN checks are all background processes
* Hibernate on SQN check
SQN check in the penciller is used for journal (all object) folds, but mainly for journal compaction. Use this to trigger hibernation where SST files stay quiet after the compaction check.
* Add catch for hibernate timeout
* Scale cache_size with level
Based on volume testing. Relatively speaking, far higher value to be gained from caches at higher levels (lower numbered levels). The cache at lower levels are proportionally much less efficient. so cache more at higher levels, where there is value, and less at lower levels where there is more cost relative to value.
* Make minimal change to previous setup
Making significant change appears to not have had the expected positive improvement - so a more minimal change is proposed.
The assumption is that the cache only really gets used for double reads in the write path (e.g. where the application reads before a write) - and so a large cache make minimal difference, but no cache still has a downside.
* Introduce new types
* More memory management
Clear blockindex_cache on timeout, and manually GC on pclerk after work.
* Add further garbage collection prompt
After fetching level zero, significant change in references in the penciller memory, so prompt a garbage_collect() at this point.
2022-04-23 13:38:20 +01:00
|
|
|
-spec fetch_sqn(
|
|
|
|
leveled_codec:ledger_key(),
|
|
|
|
leveled_codec:segment_hash(),
|
|
|
|
leveled_pmanifest:manifest(),
|
|
|
|
list(),
|
|
|
|
leveled_pmem:index_array()) ->
|
2023-03-14 16:27:08 +00:00
|
|
|
not_present|leveled_codec:ledger_kv()|leveled_codec:sqn().
|
2018-05-01 21:28:40 +01:00
|
|
|
%% @doc
|
|
|
|
%% Fetch the result from the penciller, starting by looking in the memory,
|
|
|
|
%% and if it is not found looking down level by level through the LSM tree.
|
Mas i370 d31 sstmemory (#373)
* Don't use fetch_cache below the page_cache level
* Don't time fetches due to SQN checks
SQN checks are all background processes
* Hibernate on SQN check
SQN check in the penciller is used for journal (all object) folds, but mainly for journal compaction. Use this to trigger hibernation where SST files stay quiet after the compaction check.
* Add catch for hibernate timeout
* Scale cache_size with level
Based on volume testing. Relatively speaking, far higher value to be gained from caches at higher levels (lower numbered levels). The cache at lower levels are proportionally much less efficient. so cache more at higher levels, where there is value, and less at lower levels where there is more cost relative to value.
* OTP 24 fix to cherry-pick
* Make minimal change to previous setup
Making significant change appears to not have had the expected positive improvement - so a more minimal change is proposed.
The assumption is that the cache only really gets used for double reads in the write path (e.g. where the application reads before a write) - and so a large cache make minimal difference, but no cache still has a downside.
* Introduce new types
* Mas i370 d30 sstmemory (#374)
* Don't time fetches due to SQN checks
SQN checks are all background processes
* Hibernate on SQN check
SQN check in the penciller is used for journal (all object) folds, but mainly for journal compaction. Use this to trigger hibernation where SST files stay quiet after the compaction check.
* Add catch for hibernate timeout
* Scale cache_size with level
Based on volume testing. Relatively speaking, far higher value to be gained from caches at higher levels (lower numbered levels). The cache at lower levels are proportionally much less efficient. so cache more at higher levels, where there is value, and less at lower levels where there is more cost relative to value.
* Make minimal change to previous setup
Making significant change appears to not have had the expected positive improvement - so a more minimal change is proposed.
The assumption is that the cache only really gets used for double reads in the write path (e.g. where the application reads before a write) - and so a large cache make minimal difference, but no cache still has a downside.
* Introduce new types
* More memory management
Clear blockindex_cache on timeout, and manually GC on pclerk after work.
* Add further garbage collection prompt
After fetching level zero, significant change in references in the penciller memory, so prompt a garbage_collect() at this point.
2022-04-23 13:38:20 +01:00
|
|
|
fetch_sqn(Key, Hash, Manifest, L0Cache, L0Index) ->
|
|
|
|
R = fetch_mem(Key, Hash, Manifest, L0Cache, L0Index, fun sst_getsqn/4),
|
2016-12-22 14:03:31 +00:00
|
|
|
element(1, R).
|
2016-12-11 05:23:24 +00:00
|
|
|
|
Mas i370 d31 sstmemory (#373)
* Don't use fetch_cache below the page_cache level
* Don't time fetches due to SQN checks
SQN checks are all background processes
* Hibernate on SQN check
SQN check in the penciller is used for journal (all object) folds, but mainly for journal compaction. Use this to trigger hibernation where SST files stay quiet after the compaction check.
* Add catch for hibernate timeout
* Scale cache_size with level
Based on volume testing. Relatively speaking, far higher value to be gained from caches at higher levels (lower numbered levels). The cache at lower levels are proportionally much less efficient. so cache more at higher levels, where there is value, and less at lower levels where there is more cost relative to value.
* OTP 24 fix to cherry-pick
* Make minimal change to previous setup
Making significant change appears to not have had the expected positive improvement - so a more minimal change is proposed.
The assumption is that the cache only really gets used for double reads in the write path (e.g. where the application reads before a write) - and so a large cache make minimal difference, but no cache still has a downside.
* Introduce new types
* Mas i370 d30 sstmemory (#374)
* Don't time fetches due to SQN checks
SQN checks are all background processes
* Hibernate on SQN check
SQN check in the penciller is used for journal (all object) folds, but mainly for journal compaction. Use this to trigger hibernation where SST files stay quiet after the compaction check.
* Add catch for hibernate timeout
* Scale cache_size with level
Based on volume testing. Relatively speaking, far higher value to be gained from caches at higher levels (lower numbered levels). The cache at lower levels are proportionally much less efficient. so cache more at higher levels, where there is value, and less at lower levels where there is more cost relative to value.
* Make minimal change to previous setup
Making significant change appears to not have had the expected positive improvement - so a more minimal change is proposed.
The assumption is that the cache only really gets used for double reads in the write path (e.g. where the application reads before a write) - and so a large cache make minimal difference, but no cache still has a downside.
* Introduce new types
* More memory management
Clear blockindex_cache on timeout, and manually GC on pclerk after work.
* Add further garbage collection prompt
After fetching level zero, significant change in references in the penciller memory, so prompt a garbage_collect() at this point.
2022-04-23 13:38:20 +01:00
|
|
|
fetch_mem(Key, Hash, Manifest, L0Cache, L0Index, FetchFun) ->
|
2018-06-23 15:15:49 +01:00
|
|
|
PosList =
|
|
|
|
case L0Index of
|
|
|
|
none ->
|
|
|
|
lists:seq(1, length(L0Cache));
|
|
|
|
_ ->
|
|
|
|
leveled_pmem:check_index(Hash, L0Index)
|
|
|
|
end,
|
2017-01-05 21:58:33 +00:00
|
|
|
L0Check = leveled_pmem:check_levelzero(Key, Hash, PosList, L0Cache),
|
2016-10-30 18:25:30 +00:00
|
|
|
case L0Check of
|
|
|
|
{false, not_found} ->
|
Mas i370 d31 sstmemory (#373)
* Don't use fetch_cache below the page_cache level
* Don't time fetches due to SQN checks
SQN checks are all background processes
* Hibernate on SQN check
SQN check in the penciller is used for journal (all object) folds, but mainly for journal compaction. Use this to trigger hibernation where SST files stay quiet after the compaction check.
* Add catch for hibernate timeout
* Scale cache_size with level
Based on volume testing. Relatively speaking, far higher value to be gained from caches at higher levels (lower numbered levels). The cache at lower levels are proportionally much less efficient. so cache more at higher levels, where there is value, and less at lower levels where there is more cost relative to value.
* OTP 24 fix to cherry-pick
* Make minimal change to previous setup
Making significant change appears to not have had the expected positive improvement - so a more minimal change is proposed.
The assumption is that the cache only really gets used for double reads in the write path (e.g. where the application reads before a write) - and so a large cache make minimal difference, but no cache still has a downside.
* Introduce new types
* Mas i370 d30 sstmemory (#374)
* Don't time fetches due to SQN checks
SQN checks are all background processes
* Hibernate on SQN check
SQN check in the penciller is used for journal (all object) folds, but mainly for journal compaction. Use this to trigger hibernation where SST files stay quiet after the compaction check.
* Add catch for hibernate timeout
* Scale cache_size with level
Based on volume testing. Relatively speaking, far higher value to be gained from caches at higher levels (lower numbered levels). The cache at lower levels are proportionally much less efficient. so cache more at higher levels, where there is value, and less at lower levels where there is more cost relative to value.
* Make minimal change to previous setup
Making significant change appears to not have had the expected positive improvement - so a more minimal change is proposed.
The assumption is that the cache only really gets used for double reads in the write path (e.g. where the application reads before a write) - and so a large cache make minimal difference, but no cache still has a downside.
* Introduce new types
* More memory management
Clear blockindex_cache on timeout, and manually GC on pclerk after work.
* Add further garbage collection prompt
After fetching level zero, significant change in references in the penciller memory, so prompt a garbage_collect() at this point.
2022-04-23 13:38:20 +01:00
|
|
|
fetch(Key, Hash, Manifest, 0, FetchFun);
|
2016-10-30 18:25:30 +00:00
|
|
|
{true, KV} ->
|
2017-11-21 19:58:36 +00:00
|
|
|
{KV, memory}
|
2016-08-09 16:09:29 +01:00
|
|
|
end.
|
|
|
|
|
2018-05-01 21:28:40 +01:00
|
|
|
-spec fetch(tuple(), {integer(), integer()},
|
|
|
|
leveled_pmanifest:manifest(), integer(),
|
2020-12-04 12:49:17 +00:00
|
|
|
sst_fetchfun()) -> {tuple()|not_present, integer()|basement}.
|
2018-05-01 21:28:40 +01:00
|
|
|
%% @doc
|
|
|
|
%% Fetch from the persisted portion of the LSM tree, checking each level in
|
|
|
|
%% turn until a match is found.
|
|
|
|
%% Levels can be skipped by checking the bloom for the relevant file at that
|
|
|
|
%% level.
|
2017-01-13 18:23:57 +00:00
|
|
|
fetch(_Key, _Hash, _Manifest, ?MAX_LEVELS + 1, _FetchFun) ->
|
2016-12-22 14:03:31 +00:00
|
|
|
{not_present, basement};
|
2017-01-13 18:23:57 +00:00
|
|
|
fetch(Key, Hash, Manifest, Level, FetchFun) ->
|
2017-01-17 14:11:50 +00:00
|
|
|
case leveled_pmanifest:key_lookup(Manifest, Level, Key) of
|
2017-01-12 13:48:43 +00:00
|
|
|
false ->
|
2017-01-13 18:23:57 +00:00
|
|
|
fetch(Key, Hash, Manifest, Level + 1, FetchFun);
|
|
|
|
FP ->
|
2017-11-28 01:19:30 +00:00
|
|
|
case leveled_pmanifest:check_bloom(Manifest, FP, Hash) of
|
|
|
|
true ->
|
|
|
|
case FetchFun(FP, Key, Hash, Level) of
|
|
|
|
not_present ->
|
|
|
|
fetch(Key, Hash, Manifest, Level + 1, FetchFun);
|
|
|
|
ObjectFound ->
|
|
|
|
{ObjectFound, Level}
|
|
|
|
end;
|
|
|
|
false ->
|
|
|
|
fetch(Key, Hash, Manifest, Level + 1, FetchFun)
|
2016-08-09 16:09:29 +01:00
|
|
|
end
|
|
|
|
end.
|
2016-09-26 10:55:08 +01:00
|
|
|
|
2017-03-13 12:16:36 +00:00
|
|
|
timed_sst_get(PID, Key, Hash, Level) ->
|
2016-12-21 12:45:27 +00:00
|
|
|
SW = os:timestamp(),
|
2016-12-29 02:07:14 +00:00
|
|
|
R = leveled_sst:sst_get(PID, Key, Hash),
|
2016-12-21 12:45:27 +00:00
|
|
|
T0 = timer:now_diff(os:timestamp(), SW),
|
2017-03-13 12:16:36 +00:00
|
|
|
log_slowfetch(T0, R, PID, Level, ?SLOW_FETCH).
|
2017-02-26 21:48:04 +00:00
|
|
|
|
Mas i370 d31 sstmemory (#373)
* Don't use fetch_cache below the page_cache level
* Don't time fetches due to SQN checks
SQN checks are all background processes
* Hibernate on SQN check
SQN check in the penciller is used for journal (all object) folds, but mainly for journal compaction. Use this to trigger hibernation where SST files stay quiet after the compaction check.
* Add catch for hibernate timeout
* Scale cache_size with level
Based on volume testing. Relatively speaking, far higher value to be gained from caches at higher levels (lower numbered levels). The cache at lower levels are proportionally much less efficient. so cache more at higher levels, where there is value, and less at lower levels where there is more cost relative to value.
* OTP 24 fix to cherry-pick
* Make minimal change to previous setup
Making significant change appears to not have had the expected positive improvement - so a more minimal change is proposed.
The assumption is that the cache only really gets used for double reads in the write path (e.g. where the application reads before a write) - and so a large cache make minimal difference, but no cache still has a downside.
* Introduce new types
* Mas i370 d30 sstmemory (#374)
* Don't time fetches due to SQN checks
SQN checks are all background processes
* Hibernate on SQN check
SQN check in the penciller is used for journal (all object) folds, but mainly for journal compaction. Use this to trigger hibernation where SST files stay quiet after the compaction check.
* Add catch for hibernate timeout
* Scale cache_size with level
Based on volume testing. Relatively speaking, far higher value to be gained from caches at higher levels (lower numbered levels). The cache at lower levels are proportionally much less efficient. so cache more at higher levels, where there is value, and less at lower levels where there is more cost relative to value.
* Make minimal change to previous setup
Making significant change appears to not have had the expected positive improvement - so a more minimal change is proposed.
The assumption is that the cache only really gets used for double reads in the write path (e.g. where the application reads before a write) - and so a large cache make minimal difference, but no cache still has a downside.
* Introduce new types
* More memory management
Clear blockindex_cache on timeout, and manually GC on pclerk after work.
* Add further garbage collection prompt
After fetching level zero, significant change in references in the penciller memory, so prompt a garbage_collect() at this point.
2022-04-23 13:38:20 +01:00
|
|
|
sst_getsqn(PID, Key, Hash, _Level) ->
|
|
|
|
leveled_sst:sst_getsqn(PID, Key, Hash).
|
|
|
|
|
2017-03-13 12:16:36 +00:00
|
|
|
log_slowfetch(T0, R, PID, Level, FetchTolerance) ->
|
2016-12-21 12:45:27 +00:00
|
|
|
case {T0, R} of
|
2017-02-26 21:48:04 +00:00
|
|
|
{T, R} when T < FetchTolerance ->
|
2016-12-21 12:45:27 +00:00
|
|
|
R;
|
|
|
|
{T, not_present} ->
|
Develop 3.1 d30update (#386)
* Mas i370 patch d (#383)
* Refactor penciller memory
In high-volume tests on large key-count clusters, so significant variation in the P0031 time has been seen:
TimeBucket PatchA
a.0ms_to_1ms 18554
b.1ms_to_2ms 51778
c.2ms_to_3ms 696
d.3ms_to_5ms 220
e.5ms_to_8ms 59
f.8ms_to_13ms 40
g.13ms_to_21ms 364
h.21ms_to_34ms 277
i.34ms_to_55ms 34
j.55ms_to_89ms 17
k.89ms_to_144ms 21
l.144ms_to_233ms 31
m.233ms_to_377ms 45
n.377ms_to_610ms 52
o.610ms_to_987ms 59
p.987ms_to_1597ms 55
q.1597ms_to_2684ms 54
r.2684ms_to_4281ms 29
s.4281ms_to_6965ms 7
t.6295ms_to_11246ms 1
It is unclear why this varies so much. The time to add to the cache appears to be minimal (but perhaps there is an issue with timing points in the code), whereas the time to add to the index is much more significant and variable. There is also variable time when the memory is rolled (although the actual activity here appears to be minimal.
The refactoring here is two-fold:
- tidy and simplify by keeping LoopState managed within handle_call, and add more helpful dialyzer specs;
- change the update to the index to be a simple extension of a list, rather than any conversion.
This alternative version of the pmem index in unit test is orders of magnitude faster to add - and is the same order of magnitude to check. Anticipation is that it may be more efficient in terms of memory changes.
* Compress SST index
Reduces the size of the leveled_sst index with two changes:
1 - Where there is a common prefix of tuple elements (e.g. Bucket) across the whole leveled_sst file - only the non-common part is indexed, and a function is used to compare.
2 - There is less "indexing" of the index i.e. only 1 in 16 keys are passed into the gb_trees part instead of 1 in 4
* Immediate hibernate
Reasons for delay in hibernate were not clear.
Straight after creation the process will not be in receipt of messages (must wait for the manifest to be updated), so better to hibernate now. This also means the log PC023 provides more accurate information.
* Refactor BIC
This patch avoids the following:
- repeated replacement of the same element in the BIC (via get_kvrange), by checking presence via GET before sing SET
- Stops re-reading of all elements to discover high modified date
Also there appears to have been a bug where a missing HMD for the file is required to add to the cache. However, now the cache may be erased without erasing the HMD. This means that the cache can never be rebuilt
* Use correct size in test results
erts_debug:flat_size/1 returns size in words (i.e. 8 bytes on 64-bit CPU) not bytes
* Don't change summary record
As it is persisted as part of the file write, any change to the summary record cannot be rolled back
* Clerk to prompt L0 write
Simplifies the logic if the clerk request work for the penciller prompts L0 writes as well as Manifest changes.
The advantage now is that if the penciller memory is full, and PUT load stops, the clerk should still be able to prompt persistence. the penciller can therefore make use of dead time this way
* Add push on journal compact
If there has been a backlog, followed by a quiet period - there may be a large ledger cache left unpushed. Journal compaction events are about once per hour, so the performance overhead of a false push should be minimal, with the advantage of clearing any backlog before load starts again.
This is only relevant to riak users with very off/full batch type workloads.
* Extend tests
To more consistently trigger all overload scenarios
* Fix range keys smaller than prefix
Can't make end key an empty binary in this case, as it may be bigger than any keys within the range, but will appear to be smaller.
Unit tests and ct tests added to expose the potential issue
* Tidy-up
- Remove penciller logs which are no longer called
- Get pclerk to only wait MIN_TIMEOUT after doing work, in case there is a backlog
- Remove update_levelzero_cache function as it is unique to handle_call of push_mem, and simple enough to be inline
- Alight testutil slow offer with standard slow offer used
* Tidy-up
Remove pre-otp20 references.
Reinstate the check that the starting pid is still active, this was added to tidy up shutdown.
Resolve failure to run on otp20 due to `-if` sttaement
* Tidy up
Using null rather then {null, Key} is potentially clearer as it is not a concern what they Key is in this case, and removes a comparison step from the leveled_codec:endkey_passed/2 function.
There were issues with coverage in eunit tests as the leveled_pclerk shut down. This prompted a general tidy of leveled_pclerk (remove passing of LoopState into internal functions, and add dialyzer specs.
* Remove R16 relic
* Further testing another issue
The StartKey must always be less than or equal to the prefix when the first N characters are stripped, but this is not true of the EndKey (for the query) which does not have to be between the FirstKey and the LastKey.
If the EndKey query does not match it must be greater than the Prefix (as otherwise it would not have been greater than the FirstKey - so set to null.
* Fix unit test
Unit test had a typo - and result interpretation had a misunderstanding.
* Code and spec tidy
Also look to the cover the situation when the FirstKey is the same as the Prefix with tests.
This is, in theory, not an issue as it is the EndKey for each sublist which is indexed in leveled_tree. However, guard against it mapping to null here, just in case there are dangers lurking (note that tests will still pass without `M > N` guard in place.
* Hibernate on BIC complete
There are three situations when the BIC becomes complete:
- In a file created as part of a merge the BIS is learned in the merge
- After startup, files below L1 learn the block cache through reads that happen to read the block, eventually the while cache will be read, unless...
- Either before/after the cache is complete, it can get whiped by a timeout after a get_sqn request (e.g. as prompted by a journal compaction) ... it will then be re-filled of the back of get/get-range requests.
In all these situations we want to hibernate after the BIC is fill - to reflect the fact that the LoopState should now be relatively stable, so it is a good point to GC and rationalise location of data.
Previously on the the first base was covered. Now all three are covered through the bic_complete message.
* Test all index keys have same term
This works functionally, but is not optimised (the term is replicated in the index)
* Summaries with same index term
If the summary index all have the same index term - only the object keys need to be indexes
* Simplify case statements
We either match the pattern of <<Prefix:N, Suffix>> or the answer should be null
* OK for M == N
If M = N for the first key, it will have a suffix of <<>>. This will match (as expected) a query Start Key of the sam size, and be smaller than any query Start Key that has the same prefix.
If the query Start Key does not match the prefix - it will be null - as it must be smaller than the Prefix (as other wise the query Start Key would be bigger than the Last Key).
The constraint of M > N was introduced before the *_prefix_filter functions were checking the prefix, to avoid issues. Now the prefix is being checked, then M == N is ok.
* Simplify
Correct the test to use a binary field in the range.
To avoid further issue, only apply filter when everything is a binary() type.
* Add test for head_only mode
When leveled is used as a tictacaae key store (in parallel mode), the keys will be head_only entries. Double check they are handled as expected like object keys
* Revert previous change - must support typed buckets
Add assertion to confirm worthwhile optimisation
* Add support for configurable cache multiple (#375)
* Mas i370 patch e (#385)
Improvement to monitoring for efficiency and improved readability of logs and stats.
As part of this, where possible, tried to avoid updating loop state on READ messages in leveled processes (as was the case when tracking stats within each process).
No performance benefits found with change, but improved stats has helped discover other potential gains.
2022-12-18 20:18:03 +00:00
|
|
|
leveled_log:log(pc016, [PID, T, Level, not_present]),
|
2016-12-21 12:45:27 +00:00
|
|
|
not_present;
|
|
|
|
{T, R} ->
|
Develop 3.1 d30update (#386)
* Mas i370 patch d (#383)
* Refactor penciller memory
In high-volume tests on large key-count clusters, so significant variation in the P0031 time has been seen:
TimeBucket PatchA
a.0ms_to_1ms 18554
b.1ms_to_2ms 51778
c.2ms_to_3ms 696
d.3ms_to_5ms 220
e.5ms_to_8ms 59
f.8ms_to_13ms 40
g.13ms_to_21ms 364
h.21ms_to_34ms 277
i.34ms_to_55ms 34
j.55ms_to_89ms 17
k.89ms_to_144ms 21
l.144ms_to_233ms 31
m.233ms_to_377ms 45
n.377ms_to_610ms 52
o.610ms_to_987ms 59
p.987ms_to_1597ms 55
q.1597ms_to_2684ms 54
r.2684ms_to_4281ms 29
s.4281ms_to_6965ms 7
t.6295ms_to_11246ms 1
It is unclear why this varies so much. The time to add to the cache appears to be minimal (but perhaps there is an issue with timing points in the code), whereas the time to add to the index is much more significant and variable. There is also variable time when the memory is rolled (although the actual activity here appears to be minimal.
The refactoring here is two-fold:
- tidy and simplify by keeping LoopState managed within handle_call, and add more helpful dialyzer specs;
- change the update to the index to be a simple extension of a list, rather than any conversion.
This alternative version of the pmem index in unit test is orders of magnitude faster to add - and is the same order of magnitude to check. Anticipation is that it may be more efficient in terms of memory changes.
* Compress SST index
Reduces the size of the leveled_sst index with two changes:
1 - Where there is a common prefix of tuple elements (e.g. Bucket) across the whole leveled_sst file - only the non-common part is indexed, and a function is used to compare.
2 - There is less "indexing" of the index i.e. only 1 in 16 keys are passed into the gb_trees part instead of 1 in 4
* Immediate hibernate
Reasons for delay in hibernate were not clear.
Straight after creation the process will not be in receipt of messages (must wait for the manifest to be updated), so better to hibernate now. This also means the log PC023 provides more accurate information.
* Refactor BIC
This patch avoids the following:
- repeated replacement of the same element in the BIC (via get_kvrange), by checking presence via GET before sing SET
- Stops re-reading of all elements to discover high modified date
Also there appears to have been a bug where a missing HMD for the file is required to add to the cache. However, now the cache may be erased without erasing the HMD. This means that the cache can never be rebuilt
* Use correct size in test results
erts_debug:flat_size/1 returns size in words (i.e. 8 bytes on 64-bit CPU) not bytes
* Don't change summary record
As it is persisted as part of the file write, any change to the summary record cannot be rolled back
* Clerk to prompt L0 write
Simplifies the logic if the clerk request work for the penciller prompts L0 writes as well as Manifest changes.
The advantage now is that if the penciller memory is full, and PUT load stops, the clerk should still be able to prompt persistence. the penciller can therefore make use of dead time this way
* Add push on journal compact
If there has been a backlog, followed by a quiet period - there may be a large ledger cache left unpushed. Journal compaction events are about once per hour, so the performance overhead of a false push should be minimal, with the advantage of clearing any backlog before load starts again.
This is only relevant to riak users with very off/full batch type workloads.
* Extend tests
To more consistently trigger all overload scenarios
* Fix range keys smaller than prefix
Can't make end key an empty binary in this case, as it may be bigger than any keys within the range, but will appear to be smaller.
Unit tests and ct tests added to expose the potential issue
* Tidy-up
- Remove penciller logs which are no longer called
- Get pclerk to only wait MIN_TIMEOUT after doing work, in case there is a backlog
- Remove update_levelzero_cache function as it is unique to handle_call of push_mem, and simple enough to be inline
- Alight testutil slow offer with standard slow offer used
* Tidy-up
Remove pre-otp20 references.
Reinstate the check that the starting pid is still active, this was added to tidy up shutdown.
Resolve failure to run on otp20 due to `-if` sttaement
* Tidy up
Using null rather then {null, Key} is potentially clearer as it is not a concern what they Key is in this case, and removes a comparison step from the leveled_codec:endkey_passed/2 function.
There were issues with coverage in eunit tests as the leveled_pclerk shut down. This prompted a general tidy of leveled_pclerk (remove passing of LoopState into internal functions, and add dialyzer specs.
* Remove R16 relic
* Further testing another issue
The StartKey must always be less than or equal to the prefix when the first N characters are stripped, but this is not true of the EndKey (for the query) which does not have to be between the FirstKey and the LastKey.
If the EndKey query does not match it must be greater than the Prefix (as otherwise it would not have been greater than the FirstKey - so set to null.
* Fix unit test
Unit test had a typo - and result interpretation had a misunderstanding.
* Code and spec tidy
Also look to the cover the situation when the FirstKey is the same as the Prefix with tests.
This is, in theory, not an issue as it is the EndKey for each sublist which is indexed in leveled_tree. However, guard against it mapping to null here, just in case there are dangers lurking (note that tests will still pass without `M > N` guard in place.
* Hibernate on BIC complete
There are three situations when the BIC becomes complete:
- In a file created as part of a merge the BIS is learned in the merge
- After startup, files below L1 learn the block cache through reads that happen to read the block, eventually the while cache will be read, unless...
- Either before/after the cache is complete, it can get whiped by a timeout after a get_sqn request (e.g. as prompted by a journal compaction) ... it will then be re-filled of the back of get/get-range requests.
In all these situations we want to hibernate after the BIC is fill - to reflect the fact that the LoopState should now be relatively stable, so it is a good point to GC and rationalise location of data.
Previously on the the first base was covered. Now all three are covered through the bic_complete message.
* Test all index keys have same term
This works functionally, but is not optimised (the term is replicated in the index)
* Summaries with same index term
If the summary index all have the same index term - only the object keys need to be indexes
* Simplify case statements
We either match the pattern of <<Prefix:N, Suffix>> or the answer should be null
* OK for M == N
If M = N for the first key, it will have a suffix of <<>>. This will match (as expected) a query Start Key of the sam size, and be smaller than any query Start Key that has the same prefix.
If the query Start Key does not match the prefix - it will be null - as it must be smaller than the Prefix (as other wise the query Start Key would be bigger than the Last Key).
The constraint of M > N was introduced before the *_prefix_filter functions were checking the prefix, to avoid issues. Now the prefix is being checked, then M == N is ok.
* Simplify
Correct the test to use a binary field in the range.
To avoid further issue, only apply filter when everything is a binary() type.
* Add test for head_only mode
When leveled is used as a tictacaae key store (in parallel mode), the keys will be head_only entries. Double check they are handled as expected like object keys
* Revert previous change - must support typed buckets
Add assertion to confirm worthwhile optimisation
* Add support for configurable cache multiple (#375)
* Mas i370 patch e (#385)
Improvement to monitoring for efficiency and improved readability of logs and stats.
As part of this, where possible, tried to avoid updating loop state on READ messages in leveled processes (as was the case when tracking stats within each process).
No performance benefits found with change, but improved stats has helped discover other potential gains.
2022-12-18 20:18:03 +00:00
|
|
|
leveled_log:log(pc016, [PID, T, Level, found]),
|
2016-12-21 12:45:27 +00:00
|
|
|
R
|
|
|
|
end.
|
2016-08-09 16:09:29 +01:00
|
|
|
|
2018-05-01 21:28:40 +01:00
|
|
|
|
Mas i370 d31 sstmemory (#373)
* Don't use fetch_cache below the page_cache level
* Don't time fetches due to SQN checks
SQN checks are all background processes
* Hibernate on SQN check
SQN check in the penciller is used for journal (all object) folds, but mainly for journal compaction. Use this to trigger hibernation where SST files stay quiet after the compaction check.
* Add catch for hibernate timeout
* Scale cache_size with level
Based on volume testing. Relatively speaking, far higher value to be gained from caches at higher levels (lower numbered levels). The cache at lower levels are proportionally much less efficient. so cache more at higher levels, where there is value, and less at lower levels where there is more cost relative to value.
* OTP 24 fix to cherry-pick
* Make minimal change to previous setup
Making significant change appears to not have had the expected positive improvement - so a more minimal change is proposed.
The assumption is that the cache only really gets used for double reads in the write path (e.g. where the application reads before a write) - and so a large cache make minimal difference, but no cache still has a downside.
* Introduce new types
* Mas i370 d30 sstmemory (#374)
* Don't time fetches due to SQN checks
SQN checks are all background processes
* Hibernate on SQN check
SQN check in the penciller is used for journal (all object) folds, but mainly for journal compaction. Use this to trigger hibernation where SST files stay quiet after the compaction check.
* Add catch for hibernate timeout
* Scale cache_size with level
Based on volume testing. Relatively speaking, far higher value to be gained from caches at higher levels (lower numbered levels). The cache at lower levels are proportionally much less efficient. so cache more at higher levels, where there is value, and less at lower levels where there is more cost relative to value.
* Make minimal change to previous setup
Making significant change appears to not have had the expected positive improvement - so a more minimal change is proposed.
The assumption is that the cache only really gets used for double reads in the write path (e.g. where the application reads before a write) - and so a large cache make minimal difference, but no cache still has a downside.
* Introduce new types
* More memory management
Clear blockindex_cache on timeout, and manually GC on pclerk after work.
* Add further garbage collection prompt
After fetching level zero, significant change in references in the penciller memory, so prompt a garbage_collect() at this point.
2022-04-23 13:38:20 +01:00
|
|
|
-spec compare_to_sqn(
|
|
|
|
leveled_codec:ledger_kv()|leveled_codec:sqn()|not_present,
|
|
|
|
integer()) -> sqn_check().
|
2018-05-01 21:28:40 +01:00
|
|
|
%% @doc
|
|
|
|
%% Check to see if the SQN in the penciller is after the SQN expected for an
|
|
|
|
%% object (used to allow the journal to check compaction status from a cache
|
|
|
|
%% of the ledger - objects with a more recent sequence number can be compacted).
|
Mas i370 d31 sstmemory (#373)
* Don't use fetch_cache below the page_cache level
* Don't time fetches due to SQN checks
SQN checks are all background processes
* Hibernate on SQN check
SQN check in the penciller is used for journal (all object) folds, but mainly for journal compaction. Use this to trigger hibernation where SST files stay quiet after the compaction check.
* Add catch for hibernate timeout
* Scale cache_size with level
Based on volume testing. Relatively speaking, far higher value to be gained from caches at higher levels (lower numbered levels). The cache at lower levels are proportionally much less efficient. so cache more at higher levels, where there is value, and less at lower levels where there is more cost relative to value.
* OTP 24 fix to cherry-pick
* Make minimal change to previous setup
Making significant change appears to not have had the expected positive improvement - so a more minimal change is proposed.
The assumption is that the cache only really gets used for double reads in the write path (e.g. where the application reads before a write) - and so a large cache make minimal difference, but no cache still has a downside.
* Introduce new types
* Mas i370 d30 sstmemory (#374)
* Don't time fetches due to SQN checks
SQN checks are all background processes
* Hibernate on SQN check
SQN check in the penciller is used for journal (all object) folds, but mainly for journal compaction. Use this to trigger hibernation where SST files stay quiet after the compaction check.
* Add catch for hibernate timeout
* Scale cache_size with level
Based on volume testing. Relatively speaking, far higher value to be gained from caches at higher levels (lower numbered levels). The cache at lower levels are proportionally much less efficient. so cache more at higher levels, where there is value, and less at lower levels where there is more cost relative to value.
* Make minimal change to previous setup
Making significant change appears to not have had the expected positive improvement - so a more minimal change is proposed.
The assumption is that the cache only really gets used for double reads in the write path (e.g. where the application reads before a write) - and so a large cache make minimal difference, but no cache still has a downside.
* Introduce new types
* More memory management
Clear blockindex_cache on timeout, and manually GC on pclerk after work.
* Add further garbage collection prompt
After fetching level zero, significant change in references in the penciller memory, so prompt a garbage_collect() at this point.
2022-04-23 13:38:20 +01:00
|
|
|
compare_to_sqn(not_present, _SQN) ->
|
|
|
|
missing;
|
|
|
|
compare_to_sqn(ObjSQN, SQN) when is_integer(ObjSQN), ObjSQN > SQN ->
|
|
|
|
replaced;
|
|
|
|
compare_to_sqn(ObjSQN, _SQN) when is_integer(ObjSQN) ->
|
|
|
|
% Normally we would expect the SQN to be equal here, but
|
|
|
|
% this also allows for the Journal to have a more advanced
|
|
|
|
% value. We return true here as we wouldn't want to
|
|
|
|
% compact thta more advanced value, but this may cause
|
|
|
|
% confusion in snapshots.
|
|
|
|
current;
|
2016-10-07 10:04:48 +01:00
|
|
|
compare_to_sqn(Obj, SQN) ->
|
Mas i370 d31 sstmemory (#373)
* Don't use fetch_cache below the page_cache level
* Don't time fetches due to SQN checks
SQN checks are all background processes
* Hibernate on SQN check
SQN check in the penciller is used for journal (all object) folds, but mainly for journal compaction. Use this to trigger hibernation where SST files stay quiet after the compaction check.
* Add catch for hibernate timeout
* Scale cache_size with level
Based on volume testing. Relatively speaking, far higher value to be gained from caches at higher levels (lower numbered levels). The cache at lower levels are proportionally much less efficient. so cache more at higher levels, where there is value, and less at lower levels where there is more cost relative to value.
* OTP 24 fix to cherry-pick
* Make minimal change to previous setup
Making significant change appears to not have had the expected positive improvement - so a more minimal change is proposed.
The assumption is that the cache only really gets used for double reads in the write path (e.g. where the application reads before a write) - and so a large cache make minimal difference, but no cache still has a downside.
* Introduce new types
* Mas i370 d30 sstmemory (#374)
* Don't time fetches due to SQN checks
SQN checks are all background processes
* Hibernate on SQN check
SQN check in the penciller is used for journal (all object) folds, but mainly for journal compaction. Use this to trigger hibernation where SST files stay quiet after the compaction check.
* Add catch for hibernate timeout
* Scale cache_size with level
Based on volume testing. Relatively speaking, far higher value to be gained from caches at higher levels (lower numbered levels). The cache at lower levels are proportionally much less efficient. so cache more at higher levels, where there is value, and less at lower levels where there is more cost relative to value.
* Make minimal change to previous setup
Making significant change appears to not have had the expected positive improvement - so a more minimal change is proposed.
The assumption is that the cache only really gets used for double reads in the write path (e.g. where the application reads before a write) - and so a large cache make minimal difference, but no cache still has a downside.
* Introduce new types
* More memory management
Clear blockindex_cache on timeout, and manually GC on pclerk after work.
* Add further garbage collection prompt
After fetching level zero, significant change in references in the penciller memory, so prompt a garbage_collect() at this point.
2022-04-23 13:38:20 +01:00
|
|
|
compare_to_sqn(leveled_codec:strip_to_seqonly(Obj), SQN).
|
2016-10-07 10:04:48 +01:00
|
|
|
|
|
|
|
|
2017-10-31 23:28:35 +00:00
|
|
|
%%%============================================================================
|
|
|
|
%%% Iterator functions
|
|
|
|
%%%
|
|
|
|
%%% TODO - move to dedicated module with extended unit testing
|
|
|
|
%%%============================================================================
|
|
|
|
|
|
|
|
|
2023-03-14 16:27:08 +00:00
|
|
|
-spec keyfolder(
|
|
|
|
{list(), list()},
|
|
|
|
{leveled_codec:ledger_key(), leveled_codec:ledger_key()},
|
|
|
|
{pclacc_fun(), any(), pos_integer()},
|
|
|
|
{boolean(), {non_neg_integer(), pos_integer()|infinity}, integer()})
|
|
|
|
-> any().
|
2017-10-31 23:28:35 +00:00
|
|
|
%% @doc
|
|
|
|
%% The keyfolder will compare an iterator across the immutable in-memory cache
|
|
|
|
%% of the Penciller (the IMMiter), with an iterator across the persisted part
|
|
|
|
%% (the SSTiter).
|
|
|
|
%%
|
|
|
|
%% A Segment List and a MaxKeys may be passed. Every time something is added
|
|
|
|
%% to the accumulator MaxKeys is reduced - so set MaxKeys to -1 if it is
|
|
|
|
%% intended to be infinite.
|
|
|
|
%%
|
|
|
|
%% The basic principle is to take the next key in the IMMiter and compare it
|
|
|
|
%% to the next key in the SSTiter, and decide which one should be added to the
|
|
|
|
%% accumulator. The iterators are advanced if they either win (i.e. are the
|
|
|
|
%% next key), or are dominated. This goes on until the iterators are empty.
|
|
|
|
%%
|
|
|
|
%% To advance the SSTiter the find_nextkey/4 function is used, as the SSTiter
|
|
|
|
%% is an iterator across multiple levels - and so needs to do its own
|
|
|
|
%% comparisons to pop the next result.
|
2020-12-04 12:49:17 +00:00
|
|
|
keyfolder(_Iterators,
|
|
|
|
_KeyRange,
|
|
|
|
{_AccFun, Acc, _Now},
|
|
|
|
{_SegmentList, _LastModRange, MaxKeys}) when MaxKeys == 0 ->
|
2018-11-01 23:40:28 +00:00
|
|
|
{0, Acc};
|
2020-12-04 12:49:17 +00:00
|
|
|
keyfolder({[], SSTiter}, KeyRange, {AccFun, Acc, Now},
|
2018-10-30 19:35:29 +00:00
|
|
|
{SegmentList, LastModRange, MaxKeys}) ->
|
2017-10-31 23:28:35 +00:00
|
|
|
{StartKey, EndKey} = KeyRange,
|
2018-10-30 19:35:29 +00:00
|
|
|
case find_nextkey(SSTiter, StartKey, EndKey,
|
|
|
|
SegmentList, element(1, LastModRange)) of
|
2017-10-31 23:28:35 +00:00
|
|
|
no_more_keys ->
|
2018-10-31 14:22:28 +00:00
|
|
|
case MaxKeys > 0 of
|
|
|
|
true ->
|
2018-11-01 23:40:28 +00:00
|
|
|
% This query had a max count, so we must respond with the
|
|
|
|
% remainder on the count
|
|
|
|
{MaxKeys, Acc};
|
2018-10-31 14:22:28 +00:00
|
|
|
false ->
|
2018-11-01 23:40:28 +00:00
|
|
|
% This query started with a MaxKeys set to -1. Query is
|
|
|
|
% not interested in having MaxKeys in Response
|
2018-10-31 14:22:28 +00:00
|
|
|
Acc
|
|
|
|
end;
|
2017-10-31 23:28:35 +00:00
|
|
|
{NxSSTiter, {SSTKey, SSTVal}} ->
|
2018-10-30 19:35:29 +00:00
|
|
|
{Acc1, MK1} =
|
2020-12-04 12:49:17 +00:00
|
|
|
maybe_accumulate(SSTKey, SSTVal,
|
|
|
|
{Acc, AccFun, Now},
|
2018-10-30 19:35:29 +00:00
|
|
|
MaxKeys, LastModRange),
|
2017-10-31 23:28:35 +00:00
|
|
|
keyfolder({[], NxSSTiter},
|
|
|
|
KeyRange,
|
2020-12-04 12:49:17 +00:00
|
|
|
{AccFun, Acc1, Now},
|
2018-10-30 19:35:29 +00:00
|
|
|
{SegmentList, LastModRange, MK1})
|
2017-10-31 23:28:35 +00:00
|
|
|
end;
|
|
|
|
keyfolder({[{IMMKey, IMMVal}|NxIMMiterator], SSTiterator},
|
|
|
|
KeyRange,
|
2020-12-04 12:49:17 +00:00
|
|
|
{AccFun, Acc, Now},
|
2018-10-30 19:35:29 +00:00
|
|
|
{SegmentList, LastModRange, MaxKeys}) ->
|
2017-10-31 23:28:35 +00:00
|
|
|
{StartKey, EndKey} = KeyRange,
|
|
|
|
case {IMMKey < StartKey, leveled_codec:endkey_passed(EndKey, IMMKey)} of
|
|
|
|
{false, true} ->
|
|
|
|
% There are no more keys in-range in the in-memory
|
|
|
|
% iterator, so take action as if this iterator is empty
|
|
|
|
% (see above)
|
|
|
|
keyfolder({[], SSTiterator},
|
|
|
|
KeyRange,
|
2020-12-04 12:49:17 +00:00
|
|
|
{AccFun, Acc, Now},
|
2018-10-30 19:35:29 +00:00
|
|
|
{SegmentList, LastModRange, MaxKeys});
|
2017-10-31 23:28:35 +00:00
|
|
|
{false, false} ->
|
2018-10-30 19:35:29 +00:00
|
|
|
case find_nextkey(SSTiterator, StartKey, EndKey,
|
|
|
|
SegmentList, element(1, LastModRange)) of
|
2017-10-31 23:28:35 +00:00
|
|
|
no_more_keys ->
|
|
|
|
% No more keys in range in the persisted store, so use the
|
|
|
|
% in-memory KV as the next
|
2018-10-30 19:35:29 +00:00
|
|
|
{Acc1, MK1} =
|
2020-12-04 12:49:17 +00:00
|
|
|
maybe_accumulate(IMMKey, IMMVal,
|
|
|
|
{Acc, AccFun, Now},
|
2018-10-30 19:35:29 +00:00
|
|
|
MaxKeys, LastModRange),
|
2017-10-31 23:28:35 +00:00
|
|
|
keyfolder({NxIMMiterator,
|
|
|
|
[]},
|
|
|
|
KeyRange,
|
2020-12-04 12:49:17 +00:00
|
|
|
{AccFun, Acc1, Now},
|
2018-10-30 19:35:29 +00:00
|
|
|
{SegmentList, LastModRange, MK1});
|
2017-10-31 23:28:35 +00:00
|
|
|
{NxSSTiterator, {SSTKey, SSTVal}} ->
|
|
|
|
% There is a next key, so need to know which is the
|
|
|
|
% next key between the two (and handle two keys
|
|
|
|
% with different sequence numbers).
|
|
|
|
case leveled_codec:key_dominates({IMMKey,
|
|
|
|
IMMVal},
|
|
|
|
{SSTKey,
|
|
|
|
SSTVal}) of
|
|
|
|
left_hand_first ->
|
2018-10-30 19:35:29 +00:00
|
|
|
{Acc1, MK1} =
|
2020-12-04 12:49:17 +00:00
|
|
|
maybe_accumulate(IMMKey, IMMVal,
|
|
|
|
{Acc, AccFun, Now},
|
2018-10-30 19:35:29 +00:00
|
|
|
MaxKeys, LastModRange),
|
2017-10-31 23:28:35 +00:00
|
|
|
% Stow the previous best result away at Level -1
|
|
|
|
% so that there is no need to iterate to it again
|
|
|
|
NewEntry = {-1, [{SSTKey, SSTVal}]},
|
|
|
|
keyfolder({NxIMMiterator,
|
|
|
|
lists:keystore(-1,
|
|
|
|
1,
|
|
|
|
NxSSTiterator,
|
|
|
|
NewEntry)},
|
|
|
|
KeyRange,
|
2020-12-04 12:49:17 +00:00
|
|
|
{AccFun, Acc1, Now},
|
2018-10-30 19:35:29 +00:00
|
|
|
{SegmentList, LastModRange, MK1});
|
2017-10-31 23:28:35 +00:00
|
|
|
right_hand_first ->
|
2018-10-30 19:35:29 +00:00
|
|
|
{Acc1, MK1} =
|
2020-12-04 12:49:17 +00:00
|
|
|
maybe_accumulate(SSTKey, SSTVal,
|
|
|
|
{Acc, AccFun, Now},
|
2018-10-30 19:35:29 +00:00
|
|
|
MaxKeys, LastModRange),
|
2017-10-31 23:28:35 +00:00
|
|
|
keyfolder({[{IMMKey, IMMVal}|NxIMMiterator],
|
|
|
|
NxSSTiterator},
|
|
|
|
KeyRange,
|
2020-12-04 12:49:17 +00:00
|
|
|
{AccFun, Acc1, Now},
|
2018-10-30 19:35:29 +00:00
|
|
|
{SegmentList, LastModRange, MK1});
|
2017-10-31 23:28:35 +00:00
|
|
|
left_hand_dominant ->
|
2018-10-30 19:35:29 +00:00
|
|
|
{Acc1, MK1} =
|
2020-12-04 12:49:17 +00:00
|
|
|
maybe_accumulate(IMMKey, IMMVal,
|
|
|
|
{Acc, AccFun, Now},
|
2018-10-30 19:35:29 +00:00
|
|
|
MaxKeys, LastModRange),
|
2017-10-31 23:28:35 +00:00
|
|
|
% We can add to the accumulator here. As the SST
|
|
|
|
% key was the most dominant across all SST levels,
|
|
|
|
% so there is no need to hold off until the IMMKey
|
|
|
|
% is left hand first.
|
|
|
|
keyfolder({NxIMMiterator,
|
|
|
|
NxSSTiterator},
|
|
|
|
KeyRange,
|
2020-12-04 12:49:17 +00:00
|
|
|
{AccFun, Acc1, Now},
|
2018-10-30 19:35:29 +00:00
|
|
|
{SegmentList, LastModRange, MK1})
|
2017-10-31 23:28:35 +00:00
|
|
|
end
|
|
|
|
end
|
|
|
|
end.
|
|
|
|
|
2018-10-30 19:35:29 +00:00
|
|
|
-spec maybe_accumulate(leveled_codec:ledger_key(),
|
|
|
|
leveled_codec:ledger_value(),
|
2020-12-04 12:49:17 +00:00
|
|
|
{any(), pclacc_fun(), pos_integer()},
|
|
|
|
integer(),
|
|
|
|
{non_neg_integer(), non_neg_integer()|infinity})
|
|
|
|
-> any().
|
2018-10-30 19:35:29 +00:00
|
|
|
%% @doc
|
|
|
|
%% Make an accumulation decision based one the date range
|
2020-12-04 12:49:17 +00:00
|
|
|
maybe_accumulate(LK, LV,
|
|
|
|
{Acc, AccFun, QueryStartTime},
|
|
|
|
MaxKeys,
|
|
|
|
{LowLastMod, HighLastMod}) ->
|
2018-10-30 19:35:29 +00:00
|
|
|
{_SQN, _SH, LMD} = leveled_codec:strip_to_indexdetails({LK, LV}),
|
|
|
|
RunAcc =
|
2020-12-04 12:49:17 +00:00
|
|
|
(LMD == undefined) or
|
|
|
|
((LMD >= LowLastMod) and (LMD =< HighLastMod)),
|
|
|
|
case RunAcc and leveled_codec:is_active(LK, LV, QueryStartTime) of
|
2018-10-30 19:35:29 +00:00
|
|
|
true ->
|
|
|
|
{AccFun(LK, LV, Acc), MaxKeys - 1};
|
|
|
|
false ->
|
|
|
|
{Acc, MaxKeys}
|
|
|
|
end.
|
|
|
|
|
|
|
|
|
2023-03-14 16:27:08 +00:00
|
|
|
-spec find_nextkey(
|
|
|
|
iterator(),
|
|
|
|
leveled_codec:ledger_key(),
|
|
|
|
leveled_codec:ledger_key(),
|
|
|
|
list(non_neg_integer())|false,
|
|
|
|
non_neg_integer())
|
|
|
|
-> no_more_keys|{iterator(), leveled_codec:ledger_kv()}.
|
2018-10-30 19:35:29 +00:00
|
|
|
%% @doc
|
2016-10-12 17:12:49 +01:00
|
|
|
%% Looks to find the best choice for the next key across the levels (other
|
|
|
|
%% than in-memory table)
|
|
|
|
%% In finding the best choice, the next key in a given level may be a next
|
|
|
|
%% block or next file pointer which will need to be expanded
|
2018-10-30 19:35:29 +00:00
|
|
|
find_nextkey(QueryArray, StartKey, EndKey, SegmentList, LowLastMod) ->
|
2016-10-12 17:12:49 +01:00
|
|
|
find_nextkey(QueryArray,
|
2017-10-31 23:28:35 +00:00
|
|
|
-1,
|
2016-10-12 17:12:49 +01:00
|
|
|
{null, null},
|
2017-10-31 23:28:35 +00:00
|
|
|
StartKey, EndKey,
|
2018-10-30 19:35:29 +00:00
|
|
|
SegmentList,
|
|
|
|
LowLastMod,
|
|
|
|
?ITERATOR_SCANWIDTH).
|
2016-10-12 17:12:49 +01:00
|
|
|
|
2017-10-31 23:28:35 +00:00
|
|
|
find_nextkey(_QueryArray, LCnt,
|
|
|
|
{null, null},
|
|
|
|
_StartKey, _EndKey,
|
2018-10-30 19:35:29 +00:00
|
|
|
_SegList, _LowLastMod, _Width) when LCnt > ?MAX_LEVELS ->
|
2016-10-12 17:12:49 +01:00
|
|
|
% The array has been scanned wihtout finding a best key - must be
|
|
|
|
% exhausted - respond to indicate no more keys to be found by the
|
|
|
|
% iterator
|
|
|
|
no_more_keys;
|
2017-10-31 23:28:35 +00:00
|
|
|
find_nextkey(QueryArray, LCnt,
|
|
|
|
{BKL, BestKV},
|
|
|
|
_StartKey, _EndKey,
|
2018-10-30 19:35:29 +00:00
|
|
|
_SegList, _LowLastMod, _Width) when LCnt > ?MAX_LEVELS ->
|
2016-10-12 17:12:49 +01:00
|
|
|
% All levels have been scanned, so need to remove the best result from
|
|
|
|
% the array, and return that array along with the best key/sqn/status
|
|
|
|
% combination
|
|
|
|
{BKL, [BestKV|Tail]} = lists:keyfind(BKL, 1, QueryArray),
|
|
|
|
{lists:keyreplace(BKL, 1, QueryArray, {BKL, Tail}), BestKV};
|
2017-10-31 23:28:35 +00:00
|
|
|
find_nextkey(QueryArray, LCnt,
|
|
|
|
{BestKeyLevel, BestKV},
|
|
|
|
StartKey, EndKey,
|
2018-10-30 19:35:29 +00:00
|
|
|
SegList, LowLastMod, Width) ->
|
2016-10-12 17:12:49 +01:00
|
|
|
% Get the next key at this level
|
2017-10-31 23:28:35 +00:00
|
|
|
{NextKey, RestOfKeys} =
|
|
|
|
case lists:keyfind(LCnt, 1, QueryArray) of
|
|
|
|
false ->
|
|
|
|
{null, null};
|
|
|
|
{LCnt, []} ->
|
|
|
|
{null, null};
|
|
|
|
{LCnt, [NK|ROfKs]} ->
|
|
|
|
{NK, ROfKs}
|
|
|
|
end,
|
2016-10-12 17:12:49 +01:00
|
|
|
% Compare the next key at this level with the best key
|
|
|
|
case {NextKey, BestKeyLevel, BestKV} of
|
|
|
|
{null, BKL, BKV} ->
|
|
|
|
% There is no key at this level - go to the next level
|
2016-12-29 02:07:14 +00:00
|
|
|
find_nextkey(QueryArray,
|
|
|
|
LCnt + 1,
|
|
|
|
{BKL, BKV},
|
2017-10-31 23:28:35 +00:00
|
|
|
StartKey, EndKey,
|
2018-10-30 19:35:29 +00:00
|
|
|
SegList, LowLastMod, Width);
|
2017-01-14 16:36:05 +00:00
|
|
|
{{next, Owner, _SK}, BKL, BKV} ->
|
2016-10-12 17:12:49 +01:00
|
|
|
% The first key at this level is pointer to a file - need to query
|
|
|
|
% the file to expand this level out before proceeding
|
2016-12-29 02:07:14 +00:00
|
|
|
Pointer = {next, Owner, StartKey, EndKey},
|
2018-10-30 16:44:00 +00:00
|
|
|
UpdList = leveled_sst:sst_expandpointer(Pointer,
|
|
|
|
RestOfKeys,
|
|
|
|
Width,
|
|
|
|
SegList,
|
2018-10-30 19:35:29 +00:00
|
|
|
LowLastMod),
|
2016-12-29 02:07:14 +00:00
|
|
|
NewEntry = {LCnt, UpdList},
|
2016-10-12 17:12:49 +01:00
|
|
|
% Need to loop around at this level (LCnt) as we have not yet
|
|
|
|
% examined a real key at this level
|
|
|
|
find_nextkey(lists:keyreplace(LCnt, 1, QueryArray, NewEntry),
|
|
|
|
LCnt,
|
|
|
|
{BKL, BKV},
|
2017-10-31 23:28:35 +00:00
|
|
|
StartKey, EndKey,
|
2018-10-30 19:35:29 +00:00
|
|
|
SegList, LowLastMod, Width);
|
2016-12-29 02:07:14 +00:00
|
|
|
{{pointer, SSTPid, Slot, PSK, PEK}, BKL, BKV} ->
|
2016-10-12 17:12:49 +01:00
|
|
|
% The first key at this level is pointer within a file - need to
|
|
|
|
% query the file to expand this level out before proceeding
|
2016-12-29 02:07:14 +00:00
|
|
|
Pointer = {pointer, SSTPid, Slot, PSK, PEK},
|
2018-10-30 16:44:00 +00:00
|
|
|
UpdList = leveled_sst:sst_expandpointer(Pointer,
|
|
|
|
RestOfKeys,
|
|
|
|
Width,
|
|
|
|
SegList,
|
2018-10-30 19:35:29 +00:00
|
|
|
LowLastMod),
|
2016-12-29 02:07:14 +00:00
|
|
|
NewEntry = {LCnt, UpdList},
|
2016-10-12 17:12:49 +01:00
|
|
|
% Need to loop around at this level (LCnt) as we have not yet
|
|
|
|
% examined a real key at this level
|
|
|
|
find_nextkey(lists:keyreplace(LCnt, 1, QueryArray, NewEntry),
|
|
|
|
LCnt,
|
|
|
|
{BKL, BKV},
|
2017-10-31 23:28:35 +00:00
|
|
|
StartKey, EndKey,
|
2018-10-30 19:35:29 +00:00
|
|
|
SegList, LowLastMod, Width);
|
2016-10-12 17:12:49 +01:00
|
|
|
{{Key, Val}, null, null} ->
|
|
|
|
% No best key set - so can assume that this key is the best key,
|
2016-10-20 16:00:08 +01:00
|
|
|
% and check the lower levels
|
2016-10-12 17:12:49 +01:00
|
|
|
find_nextkey(QueryArray,
|
|
|
|
LCnt + 1,
|
|
|
|
{LCnt, {Key, Val}},
|
2017-10-31 23:28:35 +00:00
|
|
|
StartKey, EndKey,
|
2018-10-30 19:35:29 +00:00
|
|
|
SegList, LowLastMod, Width);
|
2016-10-12 17:12:49 +01:00
|
|
|
{{Key, Val}, _BKL, {BestKey, _BestVal}} when Key < BestKey ->
|
|
|
|
% There is a real key and a best key to compare, and the real key
|
|
|
|
% at this level is before the best key, and so is now the new best
|
|
|
|
% key
|
|
|
|
% The QueryArray is not modified until we have checked all levels
|
|
|
|
find_nextkey(QueryArray,
|
|
|
|
LCnt + 1,
|
|
|
|
{LCnt, {Key, Val}},
|
2017-10-31 23:28:35 +00:00
|
|
|
StartKey, EndKey,
|
2018-10-30 19:35:29 +00:00
|
|
|
SegList, LowLastMod, Width);
|
2016-10-12 17:12:49 +01:00
|
|
|
{{Key, Val}, BKL, {BestKey, BestVal}} when Key == BestKey ->
|
2016-10-13 21:02:15 +01:00
|
|
|
SQN = leveled_codec:strip_to_seqonly({Key, Val}),
|
|
|
|
BestSQN = leveled_codec:strip_to_seqonly({BestKey, BestVal}),
|
2016-10-12 17:12:49 +01:00
|
|
|
if
|
|
|
|
SQN =< BestSQN ->
|
|
|
|
% This is a dominated key, so we need to skip over it
|
2017-11-01 15:11:14 +00:00
|
|
|
NewQArray = lists:keyreplace(LCnt,
|
|
|
|
1,
|
|
|
|
QueryArray,
|
|
|
|
{LCnt, RestOfKeys}),
|
|
|
|
find_nextkey(NewQArray,
|
2016-10-12 17:12:49 +01:00
|
|
|
LCnt + 1,
|
|
|
|
{BKL, {BestKey, BestVal}},
|
2017-10-31 23:28:35 +00:00
|
|
|
StartKey, EndKey,
|
2018-10-30 19:35:29 +00:00
|
|
|
SegList, LowLastMod, Width);
|
2016-10-12 17:12:49 +01:00
|
|
|
SQN > BestSQN ->
|
|
|
|
% There is a real key at the front of this level and it has
|
|
|
|
% a higher SQN than the best key, so we should use this as
|
|
|
|
% the best key
|
|
|
|
% But we also need to remove the dominated key from the
|
|
|
|
% lower level in the query array
|
|
|
|
OldBestEntry = lists:keyfind(BKL, 1, QueryArray),
|
|
|
|
{BKL, [{BestKey, BestVal}|BestTail]} = OldBestEntry,
|
|
|
|
find_nextkey(lists:keyreplace(BKL,
|
|
|
|
1,
|
|
|
|
QueryArray,
|
|
|
|
{BKL, BestTail}),
|
|
|
|
LCnt + 1,
|
|
|
|
{LCnt, {Key, Val}},
|
2017-10-31 23:28:35 +00:00
|
|
|
StartKey, EndKey,
|
2018-10-30 19:35:29 +00:00
|
|
|
SegList, LowLastMod, Width)
|
2016-10-12 17:12:49 +01:00
|
|
|
end;
|
|
|
|
{_, BKL, BKV} ->
|
|
|
|
% This is not the best key
|
2016-12-29 02:07:14 +00:00
|
|
|
find_nextkey(QueryArray,
|
|
|
|
LCnt + 1,
|
|
|
|
{BKL, BKV},
|
2017-10-31 23:28:35 +00:00
|
|
|
StartKey, EndKey,
|
2018-10-30 19:35:29 +00:00
|
|
|
SegList, LowLastMod, Width)
|
2016-10-12 17:12:49 +01:00
|
|
|
end.
|
|
|
|
|
|
|
|
|
Develop 3.1 d30update (#386)
* Mas i370 patch d (#383)
* Refactor penciller memory
In high-volume tests on large key-count clusters, so significant variation in the P0031 time has been seen:
TimeBucket PatchA
a.0ms_to_1ms 18554
b.1ms_to_2ms 51778
c.2ms_to_3ms 696
d.3ms_to_5ms 220
e.5ms_to_8ms 59
f.8ms_to_13ms 40
g.13ms_to_21ms 364
h.21ms_to_34ms 277
i.34ms_to_55ms 34
j.55ms_to_89ms 17
k.89ms_to_144ms 21
l.144ms_to_233ms 31
m.233ms_to_377ms 45
n.377ms_to_610ms 52
o.610ms_to_987ms 59
p.987ms_to_1597ms 55
q.1597ms_to_2684ms 54
r.2684ms_to_4281ms 29
s.4281ms_to_6965ms 7
t.6295ms_to_11246ms 1
It is unclear why this varies so much. The time to add to the cache appears to be minimal (but perhaps there is an issue with timing points in the code), whereas the time to add to the index is much more significant and variable. There is also variable time when the memory is rolled (although the actual activity here appears to be minimal.
The refactoring here is two-fold:
- tidy and simplify by keeping LoopState managed within handle_call, and add more helpful dialyzer specs;
- change the update to the index to be a simple extension of a list, rather than any conversion.
This alternative version of the pmem index in unit test is orders of magnitude faster to add - and is the same order of magnitude to check. Anticipation is that it may be more efficient in terms of memory changes.
* Compress SST index
Reduces the size of the leveled_sst index with two changes:
1 - Where there is a common prefix of tuple elements (e.g. Bucket) across the whole leveled_sst file - only the non-common part is indexed, and a function is used to compare.
2 - There is less "indexing" of the index i.e. only 1 in 16 keys are passed into the gb_trees part instead of 1 in 4
* Immediate hibernate
Reasons for delay in hibernate were not clear.
Straight after creation the process will not be in receipt of messages (must wait for the manifest to be updated), so better to hibernate now. This also means the log PC023 provides more accurate information.
* Refactor BIC
This patch avoids the following:
- repeated replacement of the same element in the BIC (via get_kvrange), by checking presence via GET before sing SET
- Stops re-reading of all elements to discover high modified date
Also there appears to have been a bug where a missing HMD for the file is required to add to the cache. However, now the cache may be erased without erasing the HMD. This means that the cache can never be rebuilt
* Use correct size in test results
erts_debug:flat_size/1 returns size in words (i.e. 8 bytes on 64-bit CPU) not bytes
* Don't change summary record
As it is persisted as part of the file write, any change to the summary record cannot be rolled back
* Clerk to prompt L0 write
Simplifies the logic if the clerk request work for the penciller prompts L0 writes as well as Manifest changes.
The advantage now is that if the penciller memory is full, and PUT load stops, the clerk should still be able to prompt persistence. the penciller can therefore make use of dead time this way
* Add push on journal compact
If there has been a backlog, followed by a quiet period - there may be a large ledger cache left unpushed. Journal compaction events are about once per hour, so the performance overhead of a false push should be minimal, with the advantage of clearing any backlog before load starts again.
This is only relevant to riak users with very off/full batch type workloads.
* Extend tests
To more consistently trigger all overload scenarios
* Fix range keys smaller than prefix
Can't make end key an empty binary in this case, as it may be bigger than any keys within the range, but will appear to be smaller.
Unit tests and ct tests added to expose the potential issue
* Tidy-up
- Remove penciller logs which are no longer called
- Get pclerk to only wait MIN_TIMEOUT after doing work, in case there is a backlog
- Remove update_levelzero_cache function as it is unique to handle_call of push_mem, and simple enough to be inline
- Alight testutil slow offer with standard slow offer used
* Tidy-up
Remove pre-otp20 references.
Reinstate the check that the starting pid is still active, this was added to tidy up shutdown.
Resolve failure to run on otp20 due to `-if` sttaement
* Tidy up
Using null rather then {null, Key} is potentially clearer as it is not a concern what they Key is in this case, and removes a comparison step from the leveled_codec:endkey_passed/2 function.
There were issues with coverage in eunit tests as the leveled_pclerk shut down. This prompted a general tidy of leveled_pclerk (remove passing of LoopState into internal functions, and add dialyzer specs.
* Remove R16 relic
* Further testing another issue
The StartKey must always be less than or equal to the prefix when the first N characters are stripped, but this is not true of the EndKey (for the query) which does not have to be between the FirstKey and the LastKey.
If the EndKey query does not match it must be greater than the Prefix (as otherwise it would not have been greater than the FirstKey - so set to null.
* Fix unit test
Unit test had a typo - and result interpretation had a misunderstanding.
* Code and spec tidy
Also look to the cover the situation when the FirstKey is the same as the Prefix with tests.
This is, in theory, not an issue as it is the EndKey for each sublist which is indexed in leveled_tree. However, guard against it mapping to null here, just in case there are dangers lurking (note that tests will still pass without `M > N` guard in place.
* Hibernate on BIC complete
There are three situations when the BIC becomes complete:
- In a file created as part of a merge the BIS is learned in the merge
- After startup, files below L1 learn the block cache through reads that happen to read the block, eventually the while cache will be read, unless...
- Either before/after the cache is complete, it can get whiped by a timeout after a get_sqn request (e.g. as prompted by a journal compaction) ... it will then be re-filled of the back of get/get-range requests.
In all these situations we want to hibernate after the BIC is fill - to reflect the fact that the LoopState should now be relatively stable, so it is a good point to GC and rationalise location of data.
Previously on the the first base was covered. Now all three are covered through the bic_complete message.
* Test all index keys have same term
This works functionally, but is not optimised (the term is replicated in the index)
* Summaries with same index term
If the summary index all have the same index term - only the object keys need to be indexes
* Simplify case statements
We either match the pattern of <<Prefix:N, Suffix>> or the answer should be null
* OK for M == N
If M = N for the first key, it will have a suffix of <<>>. This will match (as expected) a query Start Key of the sam size, and be smaller than any query Start Key that has the same prefix.
If the query Start Key does not match the prefix - it will be null - as it must be smaller than the Prefix (as other wise the query Start Key would be bigger than the Last Key).
The constraint of M > N was introduced before the *_prefix_filter functions were checking the prefix, to avoid issues. Now the prefix is being checked, then M == N is ok.
* Simplify
Correct the test to use a binary field in the range.
To avoid further issue, only apply filter when everything is a binary() type.
* Add test for head_only mode
When leveled is used as a tictacaae key store (in parallel mode), the keys will be head_only entries. Double check they are handled as expected like object keys
* Revert previous change - must support typed buckets
Add assertion to confirm worthwhile optimisation
* Add support for configurable cache multiple (#375)
* Mas i370 patch e (#385)
Improvement to monitoring for efficiency and improved readability of logs and stats.
As part of this, where possible, tried to avoid updating loop state on READ messages in leveled processes (as was the case when tracking stats within each process).
No performance benefits found with change, but improved stats has helped discover other potential gains.
2022-12-18 20:18:03 +00:00
|
|
|
-spec maybelog_fetch_timing(
|
|
|
|
leveled_monitor:monitor(),
|
|
|
|
memory|leveled_pmanifest:lsm_level(),
|
|
|
|
leveled_monitor:timing(),
|
|
|
|
boolean()) -> ok.
|
|
|
|
maybelog_fetch_timing(_Monitor, _Level, no_timing, _NF) ->
|
|
|
|
ok;
|
|
|
|
maybelog_fetch_timing({Pid, _StatsFreq}, _Level, FetchTime, true) ->
|
|
|
|
leveled_monitor:add_stat(Pid, {pcl_fetch_update, not_found, FetchTime});
|
|
|
|
maybelog_fetch_timing({Pid, _StatsFreq}, Level, FetchTime, _NF) ->
|
|
|
|
leveled_monitor:add_stat(Pid, {pcl_fetch_update, Level, FetchTime}).
|
2017-11-21 19:58:36 +00:00
|
|
|
|
2016-10-05 09:54:53 +01:00
|
|
|
|
2016-07-27 18:03:44 +01:00
|
|
|
%%%============================================================================
|
|
|
|
%%% Test
|
|
|
|
%%%============================================================================
|
|
|
|
|
2016-09-08 14:21:30 +01:00
|
|
|
-ifdef(TEST).
|
|
|
|
|
2023-03-14 16:27:08 +00:00
|
|
|
-include_lib("eunit/include/eunit.hrl").
|
|
|
|
|
|
|
|
-spec pcl_fetch(
|
|
|
|
pid(), leveled_codec:ledger_key())
|
|
|
|
-> leveled_codec:ledger_kv()|not_present.
|
|
|
|
pcl_fetch(Pid, Key) ->
|
|
|
|
Hash = leveled_codec:segment_hash(Key),
|
|
|
|
if
|
|
|
|
Hash /= no_lookup ->
|
|
|
|
gen_server:call(Pid, {fetch, Key, Hash, true}, infinity)
|
|
|
|
end.
|
|
|
|
|
|
|
|
keyfolder(IMMiter, SSTiter, StartKey, EndKey, {AccFun, Acc, Now}) ->
|
|
|
|
keyfolder({IMMiter, SSTiter},
|
|
|
|
{StartKey, EndKey},
|
|
|
|
{AccFun, Acc, Now},
|
|
|
|
{false, {0, infinity}, -1}).
|
|
|
|
|
|
|
|
find_nextkey(QueryArray, StartKey, EndKey) ->
|
|
|
|
find_nextkey(QueryArray, StartKey, EndKey, false, 0).
|
|
|
|
|
2016-12-29 02:07:14 +00:00
|
|
|
|
|
|
|
generate_randomkeys({Count, StartSQN}) ->
|
2017-07-31 19:39:40 +02:00
|
|
|
generate_randomkeys(Count, StartSQN, []).
|
2016-12-29 02:07:14 +00:00
|
|
|
|
|
|
|
generate_randomkeys(0, _SQN, Acc) ->
|
|
|
|
lists:reverse(Acc);
|
|
|
|
generate_randomkeys(Count, SQN, Acc) ->
|
|
|
|
K = {o,
|
2017-07-31 20:20:39 +02:00
|
|
|
lists:concat(["Bucket", leveled_rand:uniform(1024)]),
|
|
|
|
lists:concat(["Key", leveled_rand:uniform(1024)]),
|
2016-12-29 02:07:14 +00:00
|
|
|
null},
|
|
|
|
RandKey = {K,
|
|
|
|
{SQN,
|
|
|
|
{active, infinity},
|
2017-10-20 23:04:29 +01:00
|
|
|
leveled_codec:segment_hash(K),
|
2016-12-29 02:07:14 +00:00
|
|
|
null}},
|
|
|
|
generate_randomkeys(Count - 1, SQN + 1, [RandKey|Acc]).
|
|
|
|
|
|
|
|
|
2016-09-08 14:21:30 +01:00
|
|
|
clean_testdir(RootPath) ->
|
2017-03-09 21:23:09 +00:00
|
|
|
clean_subdir(sst_rootpath(RootPath)),
|
|
|
|
clean_subdir(filename:join(RootPath, ?MANIFEST_FP)).
|
2016-09-08 14:21:30 +01:00
|
|
|
|
|
|
|
clean_subdir(DirPath) ->
|
|
|
|
case filelib:is_dir(DirPath) of
|
|
|
|
true ->
|
|
|
|
{ok, Files} = file:list_dir(DirPath),
|
2016-10-13 17:51:47 +01:00
|
|
|
lists:foreach(fun(FN) ->
|
|
|
|
File = filename:join(DirPath, FN),
|
2016-11-03 16:46:25 +00:00
|
|
|
ok = file:delete(File),
|
|
|
|
io:format("Success deleting ~s~n", [File])
|
|
|
|
end,
|
2016-09-08 14:21:30 +01:00
|
|
|
Files);
|
|
|
|
false ->
|
|
|
|
ok
|
|
|
|
end.
|
2016-07-27 18:03:44 +01:00
|
|
|
|
2016-10-18 19:41:33 +01:00
|
|
|
|
2016-10-26 21:03:50 +01:00
|
|
|
maybe_pause_push(PCL, KL) ->
|
2017-01-20 16:36:20 +00:00
|
|
|
T0 = [],
|
2017-01-05 21:58:33 +00:00
|
|
|
I0 = leveled_pmem:new_index(),
|
|
|
|
T1 = lists:foldl(fun({K, V}, {AccSL, AccIdx, MinSQN, MaxSQN}) ->
|
2017-01-20 16:36:20 +00:00
|
|
|
UpdSL = [{K, V}|AccSL],
|
2016-12-11 01:02:56 +00:00
|
|
|
SQN = leveled_codec:strip_to_seqonly({K, V}),
|
2017-10-20 23:04:29 +01:00
|
|
|
H = leveled_codec:segment_hash(K),
|
2017-01-05 21:58:33 +00:00
|
|
|
UpdIdx = leveled_pmem:prepare_for_index(AccIdx, H),
|
|
|
|
{UpdSL, UpdIdx, min(SQN, MinSQN), max(SQN, MaxSQN)}
|
2016-12-11 01:02:56 +00:00
|
|
|
end,
|
2017-01-05 21:58:33 +00:00
|
|
|
{T0, I0, infinity, 0},
|
2016-10-27 20:56:18 +01:00
|
|
|
KL),
|
2017-01-20 16:36:20 +00:00
|
|
|
SL = element(1, T1),
|
2017-01-21 11:38:26 +00:00
|
|
|
Tree = leveled_tree:from_orderedlist(lists:ukeysort(1, SL), ?CACHE_TYPE),
|
2017-01-20 16:36:20 +00:00
|
|
|
T2 = setelement(1, T1, Tree),
|
|
|
|
case pcl_pushmem(PCL, T2) of
|
2016-10-30 18:25:30 +00:00
|
|
|
returned ->
|
2016-10-27 20:56:18 +01:00
|
|
|
timer:sleep(50),
|
2016-10-26 21:03:50 +01:00
|
|
|
maybe_pause_push(PCL, KL);
|
2016-10-27 20:56:18 +01:00
|
|
|
ok ->
|
2016-10-08 22:15:48 +01:00
|
|
|
ok
|
|
|
|
end.
|
|
|
|
|
2016-12-11 01:02:56 +00:00
|
|
|
%% old test data doesn't have the magic hash
|
|
|
|
add_missing_hash({K, {SQN, ST, MD}}) ->
|
2017-10-20 23:04:29 +01:00
|
|
|
{K, {SQN, ST, leveled_codec:segment_hash(K), MD}}.
|
2016-12-11 01:02:56 +00:00
|
|
|
|
|
|
|
|
2017-09-28 10:50:54 +01:00
|
|
|
archive_files_test() ->
|
2019-01-17 21:02:29 +00:00
|
|
|
RootPath = "test/test_area/ledger",
|
2017-09-28 10:50:54 +01:00
|
|
|
SSTPath = sst_rootpath(RootPath),
|
|
|
|
ok = filelib:ensure_dir(SSTPath),
|
|
|
|
ok = file:write_file(SSTPath ++ "/test1.sst", "hello_world"),
|
|
|
|
ok = file:write_file(SSTPath ++ "/test2.sst", "hello_world"),
|
|
|
|
ok = file:write_file(SSTPath ++ "/test3.bob", "hello_world"),
|
|
|
|
UsedFiles = ["./test1.sst"],
|
|
|
|
ok = archive_files(RootPath, UsedFiles),
|
|
|
|
{ok, AllFiles} = file:list_dir(SSTPath),
|
|
|
|
?assertMatch(true, lists:member("test1.sst", AllFiles)),
|
|
|
|
?assertMatch(false, lists:member("test2.sst", AllFiles)),
|
|
|
|
?assertMatch(true, lists:member("test3.bob", AllFiles)),
|
|
|
|
?assertMatch(true, lists:member("test2.bak", AllFiles)),
|
|
|
|
ok = clean_subdir(SSTPath).
|
|
|
|
|
2017-11-28 11:43:46 +00:00
|
|
|
shutdown_when_compact(Pid) ->
|
|
|
|
FoldFun =
|
|
|
|
fun(_I, Ready) ->
|
|
|
|
case Ready of
|
|
|
|
true ->
|
|
|
|
true;
|
|
|
|
false ->
|
|
|
|
timer:sleep(200),
|
|
|
|
not pcl_checkforwork(Pid)
|
|
|
|
end
|
|
|
|
end,
|
|
|
|
true = lists:foldl(FoldFun, false, lists:seq(1, 100)),
|
|
|
|
io:format("No outstanding compaction work for ~w~n", [Pid]),
|
|
|
|
pcl_close(Pid).
|
|
|
|
|
2021-10-27 13:42:53 +01:00
|
|
|
format_status_test() ->
|
|
|
|
RootPath = "test/test_area/ledger",
|
|
|
|
clean_testdir(RootPath),
|
|
|
|
{ok, PCL} =
|
|
|
|
pcl_start(#penciller_options{root_path=RootPath,
|
|
|
|
max_inmemory_tablesize=1000,
|
|
|
|
sst_options=#sst_options{}}),
|
|
|
|
{status, PCL, {module, gen_server}, SItemL} = sys:get_status(PCL),
|
|
|
|
S = lists:keyfind(state, 1, lists:nth(5, SItemL)),
|
|
|
|
true = is_integer(array:size(element(2, S#state.manifest))),
|
|
|
|
ST = format_status(terminate, [dict:new(), S]),
|
|
|
|
?assertMatch(redacted, ST#state.manifest),
|
|
|
|
?assertMatch(redacted, ST#state.levelzero_cache),
|
|
|
|
?assertMatch(redacted, ST#state.levelzero_index),
|
|
|
|
?assertMatch(redacted, ST#state.levelzero_astree),
|
|
|
|
clean_testdir(RootPath).
|
|
|
|
|
2023-10-03 18:30:40 +01:00
|
|
|
close_no_crash_test_() ->
|
|
|
|
{timeout, 60, fun close_no_crash_tester/0}.
|
|
|
|
|
|
|
|
close_no_crash_tester() ->
|
|
|
|
RootPath = "test/test_area/ledger_close",
|
|
|
|
clean_testdir(RootPath),
|
|
|
|
{ok, PCL} =
|
|
|
|
pcl_start(
|
|
|
|
#penciller_options{
|
|
|
|
root_path=RootPath,
|
|
|
|
max_inmemory_tablesize=1000,
|
|
|
|
sst_options=#sst_options{}}),
|
|
|
|
{ok, PclSnap} =
|
|
|
|
pcl_snapstart(
|
|
|
|
#penciller_options{
|
|
|
|
start_snapshot = true,
|
|
|
|
snapshot_query = undefined,
|
|
|
|
bookies_mem = {empty_cache, empty_index, 1, 1},
|
|
|
|
source_penciller = PCL,
|
|
|
|
snapshot_longrunning = true,
|
|
|
|
bookies_pid = self()
|
|
|
|
}
|
|
|
|
),
|
|
|
|
exit(PclSnap, kill),
|
|
|
|
ok = pcl_close(PCL),
|
|
|
|
clean_testdir(RootPath).
|
|
|
|
|
|
|
|
|
2016-09-08 14:21:30 +01:00
|
|
|
simple_server_test() ->
|
2019-01-17 21:02:29 +00:00
|
|
|
RootPath = "test/test_area/ledger",
|
2016-09-08 14:21:30 +01:00
|
|
|
clean_testdir(RootPath),
|
2018-12-11 20:42:00 +00:00
|
|
|
{ok, PCL} =
|
|
|
|
pcl_start(#penciller_options{root_path=RootPath,
|
|
|
|
max_inmemory_tablesize=1000,
|
|
|
|
sst_options=#sst_options{}}),
|
2016-12-11 01:02:56 +00:00
|
|
|
Key1_Pre = {{o,"Bucket0001", "Key0001", null},
|
|
|
|
{1, {active, infinity}, null}},
|
|
|
|
Key1 = add_missing_hash(Key1_Pre),
|
2016-12-29 02:07:14 +00:00
|
|
|
KL1 = generate_randomkeys({1000, 2}),
|
2016-12-11 01:02:56 +00:00
|
|
|
Key2_Pre = {{o,"Bucket0002", "Key0002", null},
|
2016-11-07 10:11:57 +00:00
|
|
|
{1002, {active, infinity}, null}},
|
2016-12-11 01:02:56 +00:00
|
|
|
Key2 = add_missing_hash(Key2_Pre),
|
2016-12-29 02:07:14 +00:00
|
|
|
KL2 = generate_randomkeys({900, 1003}),
|
2016-11-07 10:11:57 +00:00
|
|
|
% Keep below the max table size by having 900 not 1000
|
2016-12-11 01:02:56 +00:00
|
|
|
Key3_Pre = {{o,"Bucket0003", "Key0003", null},
|
2016-11-07 10:11:57 +00:00
|
|
|
{2003, {active, infinity}, null}},
|
2016-12-11 01:02:56 +00:00
|
|
|
Key3 = add_missing_hash(Key3_Pre),
|
2016-12-29 02:07:14 +00:00
|
|
|
KL3 = generate_randomkeys({1000, 2004}),
|
2016-12-11 01:02:56 +00:00
|
|
|
Key4_Pre = {{o,"Bucket0004", "Key0004", null},
|
2016-11-07 10:11:57 +00:00
|
|
|
{3004, {active, infinity}, null}},
|
2016-12-11 01:02:56 +00:00
|
|
|
Key4 = add_missing_hash(Key4_Pre),
|
2016-12-29 02:07:14 +00:00
|
|
|
KL4 = generate_randomkeys({1000, 3005}),
|
2016-10-27 20:56:18 +01:00
|
|
|
ok = maybe_pause_push(PCL, [Key1]),
|
2016-10-12 17:12:49 +01:00
|
|
|
?assertMatch(Key1, pcl_fetch(PCL, {o,"Bucket0001", "Key0001", null})),
|
2016-10-27 20:56:18 +01:00
|
|
|
ok = maybe_pause_push(PCL, KL1),
|
2016-10-12 17:12:49 +01:00
|
|
|
?assertMatch(Key1, pcl_fetch(PCL, {o,"Bucket0001", "Key0001", null})),
|
2016-10-26 21:03:50 +01:00
|
|
|
ok = maybe_pause_push(PCL, [Key2]),
|
2016-10-12 17:12:49 +01:00
|
|
|
?assertMatch(Key1, pcl_fetch(PCL, {o,"Bucket0001", "Key0001", null})),
|
|
|
|
?assertMatch(Key2, pcl_fetch(PCL, {o,"Bucket0002", "Key0002", null})),
|
2016-10-27 20:56:18 +01:00
|
|
|
|
2016-10-26 21:03:50 +01:00
|
|
|
ok = maybe_pause_push(PCL, KL2),
|
2016-10-27 20:56:18 +01:00
|
|
|
?assertMatch(Key2, pcl_fetch(PCL, {o,"Bucket0002", "Key0002", null})),
|
2016-10-26 21:03:50 +01:00
|
|
|
ok = maybe_pause_push(PCL, [Key3]),
|
2016-10-19 17:34:58 +01:00
|
|
|
|
2016-10-12 17:12:49 +01:00
|
|
|
?assertMatch(Key1, pcl_fetch(PCL, {o,"Bucket0001", "Key0001", null})),
|
|
|
|
?assertMatch(Key2, pcl_fetch(PCL, {o,"Bucket0002", "Key0002", null})),
|
|
|
|
?assertMatch(Key3, pcl_fetch(PCL, {o,"Bucket0003", "Key0003", null})),
|
2017-11-28 11:43:46 +00:00
|
|
|
|
|
|
|
true = pcl_checkbloomtest(PCL, {o,"Bucket0001", "Key0001", null}),
|
|
|
|
true = pcl_checkbloomtest(PCL, {o,"Bucket0002", "Key0002", null}),
|
|
|
|
true = pcl_checkbloomtest(PCL, {o,"Bucket0003", "Key0003", null}),
|
|
|
|
false = pcl_checkbloomtest(PCL, {o,"Bucket9999", "Key9999", null}),
|
|
|
|
|
|
|
|
ok = shutdown_when_compact(PCL),
|
2019-02-26 18:16:47 +00:00
|
|
|
|
2018-12-11 20:42:00 +00:00
|
|
|
{ok, PCLr} =
|
|
|
|
pcl_start(#penciller_options{root_path=RootPath,
|
|
|
|
max_inmemory_tablesize=1000,
|
|
|
|
sst_options=#sst_options{}}),
|
2016-11-07 10:11:57 +00:00
|
|
|
?assertMatch(2003, pcl_getstartupsequencenumber(PCLr)),
|
|
|
|
% ok = maybe_pause_push(PCLr, [Key2] ++ KL2 ++ [Key3]),
|
2017-11-28 11:43:46 +00:00
|
|
|
true = pcl_checkbloomtest(PCLr, {o,"Bucket0001", "Key0001", null}),
|
|
|
|
true = pcl_checkbloomtest(PCLr, {o,"Bucket0002", "Key0002", null}),
|
|
|
|
true = pcl_checkbloomtest(PCLr, {o,"Bucket0003", "Key0003", null}),
|
|
|
|
false = pcl_checkbloomtest(PCLr, {o,"Bucket9999", "Key9999", null}),
|
2016-10-27 20:56:18 +01:00
|
|
|
|
2016-10-12 17:12:49 +01:00
|
|
|
?assertMatch(Key1, pcl_fetch(PCLr, {o,"Bucket0001", "Key0001", null})),
|
|
|
|
?assertMatch(Key2, pcl_fetch(PCLr, {o,"Bucket0002", "Key0002", null})),
|
|
|
|
?assertMatch(Key3, pcl_fetch(PCLr, {o,"Bucket0003", "Key0003", null})),
|
2016-10-26 21:03:50 +01:00
|
|
|
ok = maybe_pause_push(PCLr, KL3),
|
|
|
|
ok = maybe_pause_push(PCLr, [Key4]),
|
|
|
|
ok = maybe_pause_push(PCLr, KL4),
|
2016-10-12 17:12:49 +01:00
|
|
|
?assertMatch(Key1, pcl_fetch(PCLr, {o,"Bucket0001", "Key0001", null})),
|
|
|
|
?assertMatch(Key2, pcl_fetch(PCLr, {o,"Bucket0002", "Key0002", null})),
|
|
|
|
?assertMatch(Key3, pcl_fetch(PCLr, {o,"Bucket0003", "Key0003", null})),
|
|
|
|
?assertMatch(Key4, pcl_fetch(PCLr, {o,"Bucket0004", "Key0004", null})),
|
2017-01-14 19:41:09 +00:00
|
|
|
|
Develop 3.1 d30update (#386)
* Mas i370 patch d (#383)
* Refactor penciller memory
In high-volume tests on large key-count clusters, so significant variation in the P0031 time has been seen:
TimeBucket PatchA
a.0ms_to_1ms 18554
b.1ms_to_2ms 51778
c.2ms_to_3ms 696
d.3ms_to_5ms 220
e.5ms_to_8ms 59
f.8ms_to_13ms 40
g.13ms_to_21ms 364
h.21ms_to_34ms 277
i.34ms_to_55ms 34
j.55ms_to_89ms 17
k.89ms_to_144ms 21
l.144ms_to_233ms 31
m.233ms_to_377ms 45
n.377ms_to_610ms 52
o.610ms_to_987ms 59
p.987ms_to_1597ms 55
q.1597ms_to_2684ms 54
r.2684ms_to_4281ms 29
s.4281ms_to_6965ms 7
t.6295ms_to_11246ms 1
It is unclear why this varies so much. The time to add to the cache appears to be minimal (but perhaps there is an issue with timing points in the code), whereas the time to add to the index is much more significant and variable. There is also variable time when the memory is rolled (although the actual activity here appears to be minimal.
The refactoring here is two-fold:
- tidy and simplify by keeping LoopState managed within handle_call, and add more helpful dialyzer specs;
- change the update to the index to be a simple extension of a list, rather than any conversion.
This alternative version of the pmem index in unit test is orders of magnitude faster to add - and is the same order of magnitude to check. Anticipation is that it may be more efficient in terms of memory changes.
* Compress SST index
Reduces the size of the leveled_sst index with two changes:
1 - Where there is a common prefix of tuple elements (e.g. Bucket) across the whole leveled_sst file - only the non-common part is indexed, and a function is used to compare.
2 - There is less "indexing" of the index i.e. only 1 in 16 keys are passed into the gb_trees part instead of 1 in 4
* Immediate hibernate
Reasons for delay in hibernate were not clear.
Straight after creation the process will not be in receipt of messages (must wait for the manifest to be updated), so better to hibernate now. This also means the log PC023 provides more accurate information.
* Refactor BIC
This patch avoids the following:
- repeated replacement of the same element in the BIC (via get_kvrange), by checking presence via GET before sing SET
- Stops re-reading of all elements to discover high modified date
Also there appears to have been a bug where a missing HMD for the file is required to add to the cache. However, now the cache may be erased without erasing the HMD. This means that the cache can never be rebuilt
* Use correct size in test results
erts_debug:flat_size/1 returns size in words (i.e. 8 bytes on 64-bit CPU) not bytes
* Don't change summary record
As it is persisted as part of the file write, any change to the summary record cannot be rolled back
* Clerk to prompt L0 write
Simplifies the logic if the clerk request work for the penciller prompts L0 writes as well as Manifest changes.
The advantage now is that if the penciller memory is full, and PUT load stops, the clerk should still be able to prompt persistence. the penciller can therefore make use of dead time this way
* Add push on journal compact
If there has been a backlog, followed by a quiet period - there may be a large ledger cache left unpushed. Journal compaction events are about once per hour, so the performance overhead of a false push should be minimal, with the advantage of clearing any backlog before load starts again.
This is only relevant to riak users with very off/full batch type workloads.
* Extend tests
To more consistently trigger all overload scenarios
* Fix range keys smaller than prefix
Can't make end key an empty binary in this case, as it may be bigger than any keys within the range, but will appear to be smaller.
Unit tests and ct tests added to expose the potential issue
* Tidy-up
- Remove penciller logs which are no longer called
- Get pclerk to only wait MIN_TIMEOUT after doing work, in case there is a backlog
- Remove update_levelzero_cache function as it is unique to handle_call of push_mem, and simple enough to be inline
- Alight testutil slow offer with standard slow offer used
* Tidy-up
Remove pre-otp20 references.
Reinstate the check that the starting pid is still active, this was added to tidy up shutdown.
Resolve failure to run on otp20 due to `-if` sttaement
* Tidy up
Using null rather then {null, Key} is potentially clearer as it is not a concern what they Key is in this case, and removes a comparison step from the leveled_codec:endkey_passed/2 function.
There were issues with coverage in eunit tests as the leveled_pclerk shut down. This prompted a general tidy of leveled_pclerk (remove passing of LoopState into internal functions, and add dialyzer specs.
* Remove R16 relic
* Further testing another issue
The StartKey must always be less than or equal to the prefix when the first N characters are stripped, but this is not true of the EndKey (for the query) which does not have to be between the FirstKey and the LastKey.
If the EndKey query does not match it must be greater than the Prefix (as otherwise it would not have been greater than the FirstKey - so set to null.
* Fix unit test
Unit test had a typo - and result interpretation had a misunderstanding.
* Code and spec tidy
Also look to the cover the situation when the FirstKey is the same as the Prefix with tests.
This is, in theory, not an issue as it is the EndKey for each sublist which is indexed in leveled_tree. However, guard against it mapping to null here, just in case there are dangers lurking (note that tests will still pass without `M > N` guard in place.
* Hibernate on BIC complete
There are three situations when the BIC becomes complete:
- In a file created as part of a merge the BIS is learned in the merge
- After startup, files below L1 learn the block cache through reads that happen to read the block, eventually the while cache will be read, unless...
- Either before/after the cache is complete, it can get whiped by a timeout after a get_sqn request (e.g. as prompted by a journal compaction) ... it will then be re-filled of the back of get/get-range requests.
In all these situations we want to hibernate after the BIC is fill - to reflect the fact that the LoopState should now be relatively stable, so it is a good point to GC and rationalise location of data.
Previously on the the first base was covered. Now all three are covered through the bic_complete message.
* Test all index keys have same term
This works functionally, but is not optimised (the term is replicated in the index)
* Summaries with same index term
If the summary index all have the same index term - only the object keys need to be indexes
* Simplify case statements
We either match the pattern of <<Prefix:N, Suffix>> or the answer should be null
* OK for M == N
If M = N for the first key, it will have a suffix of <<>>. This will match (as expected) a query Start Key of the sam size, and be smaller than any query Start Key that has the same prefix.
If the query Start Key does not match the prefix - it will be null - as it must be smaller than the Prefix (as other wise the query Start Key would be bigger than the Last Key).
The constraint of M > N was introduced before the *_prefix_filter functions were checking the prefix, to avoid issues. Now the prefix is being checked, then M == N is ok.
* Simplify
Correct the test to use a binary field in the range.
To avoid further issue, only apply filter when everything is a binary() type.
* Add test for head_only mode
When leveled is used as a tictacaae key store (in parallel mode), the keys will be head_only entries. Double check they are handled as expected like object keys
* Revert previous change - must support typed buckets
Add assertion to confirm worthwhile optimisation
* Add support for configurable cache multiple (#375)
* Mas i370 patch e (#385)
Improvement to monitoring for efficiency and improved readability of logs and stats.
As part of this, where possible, tried to avoid updating loop state on READ messages in leveled processes (as was the case when tracking stats within each process).
No performance benefits found with change, but improved stats has helped discover other potential gains.
2022-12-18 20:18:03 +00:00
|
|
|
{ok, PclSnap, null} =
|
|
|
|
leveled_bookie:snapshot_store(
|
|
|
|
leveled_bookie:empty_ledgercache(),
|
|
|
|
PCLr,
|
|
|
|
null,
|
|
|
|
{no_monitor, 0},
|
|
|
|
ledger,
|
|
|
|
undefined,
|
|
|
|
false),
|
2017-01-14 19:41:09 +00:00
|
|
|
|
2016-10-12 17:12:49 +01:00
|
|
|
?assertMatch(Key1, pcl_fetch(PclSnap, {o,"Bucket0001", "Key0001", null})),
|
|
|
|
?assertMatch(Key2, pcl_fetch(PclSnap, {o,"Bucket0002", "Key0002", null})),
|
|
|
|
?assertMatch(Key3, pcl_fetch(PclSnap, {o,"Bucket0003", "Key0003", null})),
|
|
|
|
?assertMatch(Key4, pcl_fetch(PclSnap, {o,"Bucket0004", "Key0004", null})),
|
2020-03-09 15:12:48 +00:00
|
|
|
?assertMatch(current, pcl_checksequencenumber(PclSnap,
|
2016-10-12 17:12:49 +01:00
|
|
|
{o,
|
|
|
|
"Bucket0001",
|
|
|
|
"Key0001",
|
|
|
|
null},
|
2016-09-26 10:55:08 +01:00
|
|
|
1)),
|
2020-03-09 15:12:48 +00:00
|
|
|
?assertMatch(current, pcl_checksequencenumber(PclSnap,
|
2016-10-12 17:12:49 +01:00
|
|
|
{o,
|
|
|
|
"Bucket0002",
|
|
|
|
"Key0002",
|
|
|
|
null},
|
2016-09-26 10:55:08 +01:00
|
|
|
1002)),
|
2020-03-09 15:12:48 +00:00
|
|
|
?assertMatch(current, pcl_checksequencenumber(PclSnap,
|
2016-10-12 17:12:49 +01:00
|
|
|
{o,
|
|
|
|
"Bucket0003",
|
|
|
|
"Key0003",
|
|
|
|
null},
|
2016-10-27 20:56:18 +01:00
|
|
|
2003)),
|
2020-03-09 15:12:48 +00:00
|
|
|
?assertMatch(current, pcl_checksequencenumber(PclSnap,
|
2016-10-12 17:12:49 +01:00
|
|
|
{o,
|
|
|
|
"Bucket0004",
|
|
|
|
"Key0004",
|
|
|
|
null},
|
2016-10-27 20:56:18 +01:00
|
|
|
3004)),
|
2018-12-07 00:48:42 +00:00
|
|
|
|
2016-10-20 02:23:45 +01:00
|
|
|
% Add some more keys and confirm that check sequence number still
|
2017-11-01 15:11:14 +00:00
|
|
|
% sees the old version in the previous snapshot, but will see the new
|
|
|
|
% version in a new snapshot
|
2017-01-14 19:41:09 +00:00
|
|
|
|
2016-12-11 01:02:56 +00:00
|
|
|
Key1A_Pre = {{o,"Bucket0001", "Key0001", null},
|
|
|
|
{4005, {active, infinity}, null}},
|
|
|
|
Key1A = add_missing_hash(Key1A_Pre),
|
2016-12-29 02:07:14 +00:00
|
|
|
KL1A = generate_randomkeys({2000, 4006}),
|
2016-10-26 21:03:50 +01:00
|
|
|
ok = maybe_pause_push(PCLr, [Key1A]),
|
|
|
|
ok = maybe_pause_push(PCLr, KL1A),
|
2020-03-09 15:12:48 +00:00
|
|
|
?assertMatch(current, pcl_checksequencenumber(PclSnap,
|
2016-10-12 17:12:49 +01:00
|
|
|
{o,
|
|
|
|
"Bucket0001",
|
|
|
|
"Key0001",
|
|
|
|
null},
|
2016-09-26 10:55:08 +01:00
|
|
|
1)),
|
|
|
|
ok = pcl_close(PclSnap),
|
2017-01-14 19:41:09 +00:00
|
|
|
|
Develop 3.1 d30update (#386)
* Mas i370 patch d (#383)
* Refactor penciller memory
In high-volume tests on large key-count clusters, so significant variation in the P0031 time has been seen:
TimeBucket PatchA
a.0ms_to_1ms 18554
b.1ms_to_2ms 51778
c.2ms_to_3ms 696
d.3ms_to_5ms 220
e.5ms_to_8ms 59
f.8ms_to_13ms 40
g.13ms_to_21ms 364
h.21ms_to_34ms 277
i.34ms_to_55ms 34
j.55ms_to_89ms 17
k.89ms_to_144ms 21
l.144ms_to_233ms 31
m.233ms_to_377ms 45
n.377ms_to_610ms 52
o.610ms_to_987ms 59
p.987ms_to_1597ms 55
q.1597ms_to_2684ms 54
r.2684ms_to_4281ms 29
s.4281ms_to_6965ms 7
t.6295ms_to_11246ms 1
It is unclear why this varies so much. The time to add to the cache appears to be minimal (but perhaps there is an issue with timing points in the code), whereas the time to add to the index is much more significant and variable. There is also variable time when the memory is rolled (although the actual activity here appears to be minimal.
The refactoring here is two-fold:
- tidy and simplify by keeping LoopState managed within handle_call, and add more helpful dialyzer specs;
- change the update to the index to be a simple extension of a list, rather than any conversion.
This alternative version of the pmem index in unit test is orders of magnitude faster to add - and is the same order of magnitude to check. Anticipation is that it may be more efficient in terms of memory changes.
* Compress SST index
Reduces the size of the leveled_sst index with two changes:
1 - Where there is a common prefix of tuple elements (e.g. Bucket) across the whole leveled_sst file - only the non-common part is indexed, and a function is used to compare.
2 - There is less "indexing" of the index i.e. only 1 in 16 keys are passed into the gb_trees part instead of 1 in 4
* Immediate hibernate
Reasons for delay in hibernate were not clear.
Straight after creation the process will not be in receipt of messages (must wait for the manifest to be updated), so better to hibernate now. This also means the log PC023 provides more accurate information.
* Refactor BIC
This patch avoids the following:
- repeated replacement of the same element in the BIC (via get_kvrange), by checking presence via GET before sing SET
- Stops re-reading of all elements to discover high modified date
Also there appears to have been a bug where a missing HMD for the file is required to add to the cache. However, now the cache may be erased without erasing the HMD. This means that the cache can never be rebuilt
* Use correct size in test results
erts_debug:flat_size/1 returns size in words (i.e. 8 bytes on 64-bit CPU) not bytes
* Don't change summary record
As it is persisted as part of the file write, any change to the summary record cannot be rolled back
* Clerk to prompt L0 write
Simplifies the logic if the clerk request work for the penciller prompts L0 writes as well as Manifest changes.
The advantage now is that if the penciller memory is full, and PUT load stops, the clerk should still be able to prompt persistence. the penciller can therefore make use of dead time this way
* Add push on journal compact
If there has been a backlog, followed by a quiet period - there may be a large ledger cache left unpushed. Journal compaction events are about once per hour, so the performance overhead of a false push should be minimal, with the advantage of clearing any backlog before load starts again.
This is only relevant to riak users with very off/full batch type workloads.
* Extend tests
To more consistently trigger all overload scenarios
* Fix range keys smaller than prefix
Can't make end key an empty binary in this case, as it may be bigger than any keys within the range, but will appear to be smaller.
Unit tests and ct tests added to expose the potential issue
* Tidy-up
- Remove penciller logs which are no longer called
- Get pclerk to only wait MIN_TIMEOUT after doing work, in case there is a backlog
- Remove update_levelzero_cache function as it is unique to handle_call of push_mem, and simple enough to be inline
- Alight testutil slow offer with standard slow offer used
* Tidy-up
Remove pre-otp20 references.
Reinstate the check that the starting pid is still active, this was added to tidy up shutdown.
Resolve failure to run on otp20 due to `-if` sttaement
* Tidy up
Using null rather then {null, Key} is potentially clearer as it is not a concern what they Key is in this case, and removes a comparison step from the leveled_codec:endkey_passed/2 function.
There were issues with coverage in eunit tests as the leveled_pclerk shut down. This prompted a general tidy of leveled_pclerk (remove passing of LoopState into internal functions, and add dialyzer specs.
* Remove R16 relic
* Further testing another issue
The StartKey must always be less than or equal to the prefix when the first N characters are stripped, but this is not true of the EndKey (for the query) which does not have to be between the FirstKey and the LastKey.
If the EndKey query does not match it must be greater than the Prefix (as otherwise it would not have been greater than the FirstKey - so set to null.
* Fix unit test
Unit test had a typo - and result interpretation had a misunderstanding.
* Code and spec tidy
Also look to the cover the situation when the FirstKey is the same as the Prefix with tests.
This is, in theory, not an issue as it is the EndKey for each sublist which is indexed in leveled_tree. However, guard against it mapping to null here, just in case there are dangers lurking (note that tests will still pass without `M > N` guard in place.
* Hibernate on BIC complete
There are three situations when the BIC becomes complete:
- In a file created as part of a merge the BIS is learned in the merge
- After startup, files below L1 learn the block cache through reads that happen to read the block, eventually the while cache will be read, unless...
- Either before/after the cache is complete, it can get whiped by a timeout after a get_sqn request (e.g. as prompted by a journal compaction) ... it will then be re-filled of the back of get/get-range requests.
In all these situations we want to hibernate after the BIC is fill - to reflect the fact that the LoopState should now be relatively stable, so it is a good point to GC and rationalise location of data.
Previously on the the first base was covered. Now all three are covered through the bic_complete message.
* Test all index keys have same term
This works functionally, but is not optimised (the term is replicated in the index)
* Summaries with same index term
If the summary index all have the same index term - only the object keys need to be indexes
* Simplify case statements
We either match the pattern of <<Prefix:N, Suffix>> or the answer should be null
* OK for M == N
If M = N for the first key, it will have a suffix of <<>>. This will match (as expected) a query Start Key of the sam size, and be smaller than any query Start Key that has the same prefix.
If the query Start Key does not match the prefix - it will be null - as it must be smaller than the Prefix (as other wise the query Start Key would be bigger than the Last Key).
The constraint of M > N was introduced before the *_prefix_filter functions were checking the prefix, to avoid issues. Now the prefix is being checked, then M == N is ok.
* Simplify
Correct the test to use a binary field in the range.
To avoid further issue, only apply filter when everything is a binary() type.
* Add test for head_only mode
When leveled is used as a tictacaae key store (in parallel mode), the keys will be head_only entries. Double check they are handled as expected like object keys
* Revert previous change - must support typed buckets
Add assertion to confirm worthwhile optimisation
* Add support for configurable cache multiple (#375)
* Mas i370 patch e (#385)
Improvement to monitoring for efficiency and improved readability of logs and stats.
As part of this, where possible, tried to avoid updating loop state on READ messages in leveled processes (as was the case when tracking stats within each process).
No performance benefits found with change, but improved stats has helped discover other potential gains.
2022-12-18 20:18:03 +00:00
|
|
|
{ok, PclSnap2, null} =
|
|
|
|
leveled_bookie:snapshot_store(
|
|
|
|
leveled_bookie:empty_ledgercache(),
|
|
|
|
PCLr,
|
|
|
|
null,
|
|
|
|
{no_monitor, 0},
|
|
|
|
ledger,
|
|
|
|
undefined,
|
|
|
|
false),
|
2017-03-06 18:42:32 +00:00
|
|
|
|
2020-03-09 15:12:48 +00:00
|
|
|
?assertMatch(replaced, pcl_checksequencenumber(PclSnap2,
|
2016-10-12 17:12:49 +01:00
|
|
|
{o,
|
|
|
|
"Bucket0001",
|
|
|
|
"Key0001",
|
|
|
|
null},
|
2016-09-26 10:55:08 +01:00
|
|
|
1)),
|
2020-03-09 15:12:48 +00:00
|
|
|
?assertMatch(current, pcl_checksequencenumber(PclSnap2,
|
2016-10-12 17:12:49 +01:00
|
|
|
{o,
|
|
|
|
"Bucket0001",
|
|
|
|
"Key0001",
|
|
|
|
null},
|
2016-10-27 20:56:18 +01:00
|
|
|
4005)),
|
2020-03-09 15:12:48 +00:00
|
|
|
?assertMatch(current, pcl_checksequencenumber(PclSnap2,
|
2016-10-12 17:12:49 +01:00
|
|
|
{o,
|
|
|
|
"Bucket0002",
|
|
|
|
"Key0002",
|
|
|
|
null},
|
2016-09-26 10:55:08 +01:00
|
|
|
1002)),
|
|
|
|
ok = pcl_close(PclSnap2),
|
2016-09-08 14:21:30 +01:00
|
|
|
ok = pcl_close(PCLr),
|
|
|
|
clean_testdir(RootPath).
|
|
|
|
|
2016-09-21 18:31:42 +01:00
|
|
|
|
2016-10-12 17:12:49 +01:00
|
|
|
simple_findnextkey_test() ->
|
|
|
|
QueryArray = [
|
2018-10-30 19:35:29 +00:00
|
|
|
{2, [{{o, "Bucket1", "Key1", null}, {5, {active, infinity}, {0, 0}, null}},
|
|
|
|
{{o, "Bucket1", "Key5", null}, {4, {active, infinity}, {0, 0}, null}}]},
|
|
|
|
{3, [{{o, "Bucket1", "Key3", null}, {3, {active, infinity}, {0, 0}, null}}]},
|
|
|
|
{5, [{{o, "Bucket1", "Key2", null}, {2, {active, infinity}, {0, 0}, null}}]}
|
2016-10-12 17:12:49 +01:00
|
|
|
],
|
|
|
|
{Array2, KV1} = find_nextkey(QueryArray,
|
2018-10-30 19:35:29 +00:00
|
|
|
{o, "Bucket1", "Key0", null},
|
|
|
|
{o, "Bucket1", "Key5", null}),
|
|
|
|
?assertMatch({{o, "Bucket1", "Key1", null},
|
|
|
|
{5, {active, infinity}, {0, 0}, null}},
|
|
|
|
KV1),
|
2016-10-12 17:12:49 +01:00
|
|
|
{Array3, KV2} = find_nextkey(Array2,
|
2018-10-30 19:35:29 +00:00
|
|
|
{o, "Bucket1", "Key0", null},
|
|
|
|
{o, "Bucket1", "Key5", null}),
|
|
|
|
?assertMatch({{o, "Bucket1", "Key2", null},
|
|
|
|
{2, {active, infinity}, {0, 0}, null}},
|
|
|
|
KV2),
|
2016-10-12 17:12:49 +01:00
|
|
|
{Array4, KV3} = find_nextkey(Array3,
|
2018-10-30 19:35:29 +00:00
|
|
|
{o, "Bucket1", "Key0", null},
|
|
|
|
{o, "Bucket1", "Key5", null}),
|
|
|
|
?assertMatch({{o, "Bucket1", "Key3", null},
|
|
|
|
{3, {active, infinity}, {0, 0}, null}},
|
|
|
|
KV3),
|
2016-10-12 17:12:49 +01:00
|
|
|
{Array5, KV4} = find_nextkey(Array4,
|
2018-10-30 19:35:29 +00:00
|
|
|
{o, "Bucket1", "Key0", null},
|
|
|
|
{o, "Bucket1", "Key5", null}),
|
|
|
|
?assertMatch({{o, "Bucket1", "Key5", null},
|
|
|
|
{4, {active, infinity}, {0, 0}, null}},
|
|
|
|
KV4),
|
2016-10-12 17:12:49 +01:00
|
|
|
ER = find_nextkey(Array5,
|
2018-10-30 19:35:29 +00:00
|
|
|
{o, "Bucket1", "Key0", null},
|
|
|
|
{o, "Bucket1", "Key5", null}),
|
2016-10-12 17:12:49 +01:00
|
|
|
?assertMatch(no_more_keys, ER).
|
|
|
|
|
|
|
|
sqnoverlap_findnextkey_test() ->
|
|
|
|
QueryArray = [
|
2018-10-30 19:35:29 +00:00
|
|
|
{2, [{{o, "Bucket1", "Key1", null}, {5, {active, infinity}, {0, 0}, null}},
|
|
|
|
{{o, "Bucket1", "Key5", null}, {4, {active, infinity}, {0, 0}, null}}]},
|
|
|
|
{3, [{{o, "Bucket1", "Key3", null}, {3, {active, infinity}, {0, 0}, null}}]},
|
|
|
|
{5, [{{o, "Bucket1", "Key5", null}, {2, {active, infinity}, {0, 0}, null}}]}
|
2016-10-12 17:12:49 +01:00
|
|
|
],
|
|
|
|
{Array2, KV1} = find_nextkey(QueryArray,
|
2018-10-30 19:35:29 +00:00
|
|
|
{o, "Bucket1", "Key0", null},
|
|
|
|
{o, "Bucket1", "Key5", null}),
|
|
|
|
?assertMatch({{o, "Bucket1", "Key1", null},
|
|
|
|
{5, {active, infinity}, {0, 0}, null}},
|
2016-12-11 01:02:56 +00:00
|
|
|
KV1),
|
2016-10-12 17:12:49 +01:00
|
|
|
{Array3, KV2} = find_nextkey(Array2,
|
2018-10-30 19:35:29 +00:00
|
|
|
{o, "Bucket1", "Key0", null},
|
|
|
|
{o, "Bucket1", "Key5", null}),
|
|
|
|
?assertMatch({{o, "Bucket1", "Key3", null},
|
|
|
|
{3, {active, infinity}, {0, 0}, null}},
|
2016-12-11 01:02:56 +00:00
|
|
|
KV2),
|
2016-10-12 17:12:49 +01:00
|
|
|
{Array4, KV3} = find_nextkey(Array3,
|
2018-10-30 19:35:29 +00:00
|
|
|
{o, "Bucket1", "Key0", null},
|
|
|
|
{o, "Bucket1", "Key5", null}),
|
|
|
|
?assertMatch({{o, "Bucket1", "Key5", null},
|
|
|
|
{4, {active, infinity}, {0, 0}, null}},
|
2016-12-11 01:02:56 +00:00
|
|
|
KV3),
|
2016-10-12 17:12:49 +01:00
|
|
|
ER = find_nextkey(Array4,
|
2018-10-30 19:35:29 +00:00
|
|
|
{o, "Bucket1", "Key0", null},
|
|
|
|
{o, "Bucket1", "Key5", null}),
|
2016-10-12 17:12:49 +01:00
|
|
|
?assertMatch(no_more_keys, ER).
|
|
|
|
|
|
|
|
sqnoverlap_otherway_findnextkey_test() ->
|
|
|
|
QueryArray = [
|
2018-10-30 19:35:29 +00:00
|
|
|
{2, [{{o, "Bucket1", "Key1", null}, {5, {active, infinity}, {0, 0}, null}},
|
|
|
|
{{o, "Bucket1", "Key5", null}, {1, {active, infinity}, {0, 0}, null}}]},
|
|
|
|
{3, [{{o, "Bucket1", "Key3", null}, {3, {active, infinity}, {0, 0}, null}}]},
|
|
|
|
{5, [{{o, "Bucket1", "Key5", null}, {2, {active, infinity}, {0, 0}, null}}]}
|
2016-10-12 17:12:49 +01:00
|
|
|
],
|
|
|
|
{Array2, KV1} = find_nextkey(QueryArray,
|
2018-10-30 19:35:29 +00:00
|
|
|
{o, "Bucket1", "Key0", null},
|
|
|
|
{o, "Bucket1", "Key5", null}),
|
|
|
|
?assertMatch({{o, "Bucket1", "Key1", null},
|
|
|
|
{5, {active, infinity}, {0, 0}, null}},
|
2016-12-11 01:02:56 +00:00
|
|
|
KV1),
|
2016-10-12 17:12:49 +01:00
|
|
|
{Array3, KV2} = find_nextkey(Array2,
|
2018-10-30 19:35:29 +00:00
|
|
|
{o, "Bucket1", "Key0", null},
|
|
|
|
{o, "Bucket1", "Key5", null}),
|
|
|
|
?assertMatch({{o, "Bucket1", "Key3", null},
|
|
|
|
{3, {active, infinity}, {0, 0}, null}},
|
2016-12-11 01:02:56 +00:00
|
|
|
KV2),
|
2016-10-12 17:12:49 +01:00
|
|
|
{Array4, KV3} = find_nextkey(Array3,
|
2018-10-30 19:35:29 +00:00
|
|
|
{o, "Bucket1", "Key0", null},
|
|
|
|
{o, "Bucket1", "Key5", null}),
|
|
|
|
?assertMatch({{o, "Bucket1", "Key5", null},
|
|
|
|
{2, {active, infinity}, {0, 0}, null}},
|
2016-12-11 01:02:56 +00:00
|
|
|
KV3),
|
2016-10-12 17:12:49 +01:00
|
|
|
ER = find_nextkey(Array4,
|
2018-10-30 19:35:29 +00:00
|
|
|
{o, "Bucket1", "Key0", null},
|
|
|
|
{o, "Bucket1", "Key5", null}),
|
2016-10-12 17:12:49 +01:00
|
|
|
?assertMatch(no_more_keys, ER).
|
|
|
|
|
|
|
|
foldwithimm_simple_test() ->
|
2020-12-04 12:49:17 +00:00
|
|
|
Now = leveled_util:integer_now(),
|
2016-10-12 17:12:49 +01:00
|
|
|
QueryArray = [
|
2017-01-20 16:36:20 +00:00
|
|
|
{2, [{{o, "Bucket1", "Key1", null},
|
|
|
|
{5, {active, infinity}, 0, null}},
|
|
|
|
{{o, "Bucket1", "Key5", null},
|
|
|
|
{1, {active, infinity}, 0, null}}]},
|
|
|
|
{3, [{{o, "Bucket1", "Key3", null},
|
|
|
|
{3, {active, infinity}, 0, null}}]},
|
|
|
|
{5, [{{o, "Bucket1", "Key5", null},
|
|
|
|
{2, {active, infinity}, 0, null}}]}
|
2016-10-12 17:12:49 +01:00
|
|
|
],
|
2017-01-20 16:36:20 +00:00
|
|
|
KL1A = [{{o, "Bucket1", "Key6", null}, {7, {active, infinity}, 0, null}},
|
|
|
|
{{o, "Bucket1", "Key1", null}, {8, {active, infinity}, 0, null}},
|
|
|
|
{{o, "Bucket1", "Key8", null}, {9, {active, infinity}, 0, null}}],
|
2017-01-21 11:38:26 +00:00
|
|
|
IMM2 = leveled_tree:from_orderedlist(lists:ukeysort(1, KL1A), ?CACHE_TYPE),
|
2017-01-20 16:36:20 +00:00
|
|
|
IMMiter = leveled_tree:match_range({o, "Bucket1", "Key1", null},
|
|
|
|
{o, null, null, null},
|
|
|
|
IMM2),
|
2016-10-13 21:02:15 +01:00
|
|
|
AccFun = fun(K, V, Acc) -> SQN = leveled_codec:strip_to_seqonly({K, V}),
|
2016-10-12 17:12:49 +01:00
|
|
|
Acc ++ [{K, SQN}] end,
|
|
|
|
Acc = keyfolder(IMMiter,
|
|
|
|
QueryArray,
|
2017-01-20 16:36:20 +00:00
|
|
|
{o, "Bucket1", "Key1", null}, {o, "Bucket1", "Key6", null},
|
2020-12-04 12:49:17 +00:00
|
|
|
{AccFun, [], Now}),
|
2017-01-20 16:36:20 +00:00
|
|
|
?assertMatch([{{o, "Bucket1", "Key1", null}, 8},
|
|
|
|
{{o, "Bucket1", "Key3", null}, 3},
|
|
|
|
{{o, "Bucket1", "Key5", null}, 2},
|
|
|
|
{{o, "Bucket1", "Key6", null}, 7}], Acc),
|
2016-10-12 17:12:49 +01:00
|
|
|
|
2017-01-20 16:36:20 +00:00
|
|
|
IMMiterA = [{{o, "Bucket1", "Key1", null},
|
|
|
|
{8, {active, infinity}, 0, null}}],
|
2016-10-12 17:12:49 +01:00
|
|
|
AccA = keyfolder(IMMiterA,
|
2017-01-20 16:36:20 +00:00
|
|
|
QueryArray,
|
2017-11-01 15:11:14 +00:00
|
|
|
{o, "Bucket1", "Key1", null},
|
|
|
|
{o, "Bucket1", "Key6", null},
|
2020-12-04 12:49:17 +00:00
|
|
|
{AccFun, [], Now}),
|
2017-01-20 16:36:20 +00:00
|
|
|
?assertMatch([{{o, "Bucket1", "Key1", null}, 8},
|
|
|
|
{{o, "Bucket1", "Key3", null}, 3},
|
|
|
|
{{o, "Bucket1", "Key5", null}, 2}], AccA),
|
2016-10-12 17:12:49 +01:00
|
|
|
|
2017-11-01 15:11:14 +00:00
|
|
|
AddKV = {{o, "Bucket1", "Key4", null}, {10, {active, infinity}, 0, null}},
|
|
|
|
KL1B = [AddKV|KL1A],
|
2017-01-21 11:38:26 +00:00
|
|
|
IMM3 = leveled_tree:from_orderedlist(lists:ukeysort(1, KL1B), ?CACHE_TYPE),
|
2017-01-20 16:36:20 +00:00
|
|
|
IMMiterB = leveled_tree:match_range({o, "Bucket1", "Key1", null},
|
|
|
|
{o, null, null, null},
|
|
|
|
IMM3),
|
2017-10-31 23:28:35 +00:00
|
|
|
io:format("Compare IMM3 with QueryArrary~n"),
|
2016-10-12 17:12:49 +01:00
|
|
|
AccB = keyfolder(IMMiterB,
|
|
|
|
QueryArray,
|
2017-01-20 16:36:20 +00:00
|
|
|
{o, "Bucket1", "Key1", null}, {o, "Bucket1", "Key6", null},
|
2020-12-04 12:49:17 +00:00
|
|
|
{AccFun, [], Now}),
|
2017-01-20 16:36:20 +00:00
|
|
|
?assertMatch([{{o, "Bucket1", "Key1", null}, 8},
|
|
|
|
{{o, "Bucket1", "Key3", null}, 3},
|
|
|
|
{{o, "Bucket1", "Key4", null}, 10},
|
|
|
|
{{o, "Bucket1", "Key5", null}, 2},
|
|
|
|
{{o, "Bucket1", "Key6", null}, 7}], AccB).
|
2016-10-12 17:12:49 +01:00
|
|
|
|
2016-10-29 00:52:49 +01:00
|
|
|
create_file_test() ->
|
2019-01-17 21:02:29 +00:00
|
|
|
{RP, Filename} = {"test/test_area/", "new_file.sst"},
|
2017-03-09 21:23:09 +00:00
|
|
|
ok = file:write_file(filename:join(RP, Filename), term_to_binary("hello")),
|
2018-02-10 08:09:33 +00:00
|
|
|
KVL = lists:usort(generate_randomkeys({50000, 0})),
|
2017-01-21 11:38:26 +00:00
|
|
|
Tree = leveled_tree:from_orderedlist(KVL, ?CACHE_TYPE),
|
2019-02-26 18:16:47 +00:00
|
|
|
|
2018-12-11 20:42:00 +00:00
|
|
|
{ok, SP, noreply} =
|
|
|
|
leveled_sst:sst_newlevelzero(RP,
|
|
|
|
Filename,
|
|
|
|
1,
|
2019-02-26 18:16:47 +00:00
|
|
|
[Tree],
|
2018-12-11 20:42:00 +00:00
|
|
|
undefined,
|
|
|
|
50000,
|
|
|
|
#sst_options{press_method = native}),
|
2019-02-26 20:37:46 +00:00
|
|
|
{ok, SrcFN, StartKey, EndKey} = leveled_sst:sst_checkready(SP),
|
2016-10-29 00:52:49 +01:00
|
|
|
io:format("StartKey ~w EndKey ~w~n", [StartKey, EndKey]),
|
|
|
|
?assertMatch({o, _, _, _}, StartKey),
|
|
|
|
?assertMatch({o, _, _, _}, EndKey),
|
2017-03-09 21:23:09 +00:00
|
|
|
?assertMatch("./new_file.sst", SrcFN),
|
2016-12-29 02:07:14 +00:00
|
|
|
ok = leveled_sst:sst_clear(SP),
|
2019-01-17 21:02:29 +00:00
|
|
|
{ok, Bin} = file:read_file("test/test_area/new_file.sst.discarded"),
|
2016-10-29 00:52:49 +01:00
|
|
|
?assertMatch("hello", binary_to_term(Bin)).
|
|
|
|
|
2017-02-26 21:48:04 +00:00
|
|
|
slow_fetch_test() ->
|
2017-11-20 17:29:57 +00:00
|
|
|
?assertMatch(not_present, log_slowfetch(2, not_present, "fake", 0, 1)),
|
|
|
|
?assertMatch("value", log_slowfetch(2, "value", "fake", 0, 1)).
|
2017-02-26 21:48:04 +00:00
|
|
|
|
2017-11-21 23:13:24 +00:00
|
|
|
|
2016-11-14 20:43:38 +00:00
|
|
|
coverage_cheat_test() ->
|
|
|
|
{noreply, _State0} = handle_info(timeout, #state{}),
|
|
|
|
{ok, _State1} = code_change(null, #state{}, null).
|
2016-11-05 13:42:44 +00:00
|
|
|
|
2018-09-06 14:08:09 +01:00
|
|
|
handle_down_test() ->
|
2019-01-17 21:02:29 +00:00
|
|
|
RootPath = "test/test_area/ledger",
|
2018-09-06 14:08:09 +01:00
|
|
|
clean_testdir(RootPath),
|
2018-12-11 20:42:00 +00:00
|
|
|
{ok, PCLr} =
|
|
|
|
pcl_start(#penciller_options{root_path=RootPath,
|
|
|
|
max_inmemory_tablesize=1000,
|
|
|
|
sst_options=#sst_options{}}),
|
2018-09-06 14:08:09 +01:00
|
|
|
FakeBookie = spawn(fun loop/0),
|
|
|
|
|
|
|
|
Mon = erlang:monitor(process, FakeBookie),
|
|
|
|
|
|
|
|
FakeBookie ! {snap, PCLr, self()},
|
|
|
|
|
|
|
|
{ok, PclSnap, null} =
|
|
|
|
receive
|
|
|
|
{FakeBookie, {ok, Snap, null}} ->
|
|
|
|
{ok, Snap, null}
|
|
|
|
end,
|
2020-02-24 09:55:05 +00:00
|
|
|
|
|
|
|
CheckSnapDiesFun =
|
|
|
|
fun(_X, IsDead) ->
|
|
|
|
case IsDead of
|
|
|
|
true ->
|
|
|
|
true;
|
|
|
|
false ->
|
|
|
|
case erlang:process_info(PclSnap) of
|
|
|
|
undefined ->
|
|
|
|
true;
|
|
|
|
_ ->
|
|
|
|
timer:sleep(100),
|
|
|
|
false
|
|
|
|
end
|
|
|
|
end
|
|
|
|
end,
|
|
|
|
?assertNot(lists:foldl(CheckSnapDiesFun, false, [1, 2])),
|
2018-09-06 14:08:09 +01:00
|
|
|
|
|
|
|
FakeBookie ! stop,
|
|
|
|
|
|
|
|
receive
|
|
|
|
{'DOWN', Mon, process, FakeBookie, normal} ->
|
|
|
|
%% Now we know that pclr should have received this too!
|
|
|
|
%% (better than timer:sleep/1)
|
|
|
|
ok
|
|
|
|
end,
|
|
|
|
|
2020-02-24 09:55:05 +00:00
|
|
|
?assert(lists:foldl(CheckSnapDiesFun, false, lists:seq(1, 10))),
|
2018-09-06 14:08:09 +01:00
|
|
|
|
|
|
|
pcl_close(PCLr),
|
|
|
|
clean_testdir(RootPath).
|
|
|
|
|
2018-12-07 00:48:42 +00:00
|
|
|
|
2018-09-06 14:08:09 +01:00
|
|
|
%% the fake bookie. Some calls to leveled_bookie (like the two below)
|
|
|
|
%% do not go via the gen_server (but it looks like they expect to be
|
|
|
|
%% called by the gen_server, internally!) they use "self()" to
|
|
|
|
%% populate the bookie's pid in the pclr. This process wrapping the
|
|
|
|
%% calls ensures that the TEST controls the bookie's Pid. The
|
|
|
|
%% FakeBookie.
|
|
|
|
loop() ->
|
|
|
|
receive
|
|
|
|
{snap, PCLr, TestPid} ->
|
Develop 3.1 d30update (#386)
* Mas i370 patch d (#383)
* Refactor penciller memory
In high-volume tests on large key-count clusters, so significant variation in the P0031 time has been seen:
TimeBucket PatchA
a.0ms_to_1ms 18554
b.1ms_to_2ms 51778
c.2ms_to_3ms 696
d.3ms_to_5ms 220
e.5ms_to_8ms 59
f.8ms_to_13ms 40
g.13ms_to_21ms 364
h.21ms_to_34ms 277
i.34ms_to_55ms 34
j.55ms_to_89ms 17
k.89ms_to_144ms 21
l.144ms_to_233ms 31
m.233ms_to_377ms 45
n.377ms_to_610ms 52
o.610ms_to_987ms 59
p.987ms_to_1597ms 55
q.1597ms_to_2684ms 54
r.2684ms_to_4281ms 29
s.4281ms_to_6965ms 7
t.6295ms_to_11246ms 1
It is unclear why this varies so much. The time to add to the cache appears to be minimal (but perhaps there is an issue with timing points in the code), whereas the time to add to the index is much more significant and variable. There is also variable time when the memory is rolled (although the actual activity here appears to be minimal.
The refactoring here is two-fold:
- tidy and simplify by keeping LoopState managed within handle_call, and add more helpful dialyzer specs;
- change the update to the index to be a simple extension of a list, rather than any conversion.
This alternative version of the pmem index in unit test is orders of magnitude faster to add - and is the same order of magnitude to check. Anticipation is that it may be more efficient in terms of memory changes.
* Compress SST index
Reduces the size of the leveled_sst index with two changes:
1 - Where there is a common prefix of tuple elements (e.g. Bucket) across the whole leveled_sst file - only the non-common part is indexed, and a function is used to compare.
2 - There is less "indexing" of the index i.e. only 1 in 16 keys are passed into the gb_trees part instead of 1 in 4
* Immediate hibernate
Reasons for delay in hibernate were not clear.
Straight after creation the process will not be in receipt of messages (must wait for the manifest to be updated), so better to hibernate now. This also means the log PC023 provides more accurate information.
* Refactor BIC
This patch avoids the following:
- repeated replacement of the same element in the BIC (via get_kvrange), by checking presence via GET before sing SET
- Stops re-reading of all elements to discover high modified date
Also there appears to have been a bug where a missing HMD for the file is required to add to the cache. However, now the cache may be erased without erasing the HMD. This means that the cache can never be rebuilt
* Use correct size in test results
erts_debug:flat_size/1 returns size in words (i.e. 8 bytes on 64-bit CPU) not bytes
* Don't change summary record
As it is persisted as part of the file write, any change to the summary record cannot be rolled back
* Clerk to prompt L0 write
Simplifies the logic if the clerk request work for the penciller prompts L0 writes as well as Manifest changes.
The advantage now is that if the penciller memory is full, and PUT load stops, the clerk should still be able to prompt persistence. the penciller can therefore make use of dead time this way
* Add push on journal compact
If there has been a backlog, followed by a quiet period - there may be a large ledger cache left unpushed. Journal compaction events are about once per hour, so the performance overhead of a false push should be minimal, with the advantage of clearing any backlog before load starts again.
This is only relevant to riak users with very off/full batch type workloads.
* Extend tests
To more consistently trigger all overload scenarios
* Fix range keys smaller than prefix
Can't make end key an empty binary in this case, as it may be bigger than any keys within the range, but will appear to be smaller.
Unit tests and ct tests added to expose the potential issue
* Tidy-up
- Remove penciller logs which are no longer called
- Get pclerk to only wait MIN_TIMEOUT after doing work, in case there is a backlog
- Remove update_levelzero_cache function as it is unique to handle_call of push_mem, and simple enough to be inline
- Alight testutil slow offer with standard slow offer used
* Tidy-up
Remove pre-otp20 references.
Reinstate the check that the starting pid is still active, this was added to tidy up shutdown.
Resolve failure to run on otp20 due to `-if` sttaement
* Tidy up
Using null rather then {null, Key} is potentially clearer as it is not a concern what they Key is in this case, and removes a comparison step from the leveled_codec:endkey_passed/2 function.
There were issues with coverage in eunit tests as the leveled_pclerk shut down. This prompted a general tidy of leveled_pclerk (remove passing of LoopState into internal functions, and add dialyzer specs.
* Remove R16 relic
* Further testing another issue
The StartKey must always be less than or equal to the prefix when the first N characters are stripped, but this is not true of the EndKey (for the query) which does not have to be between the FirstKey and the LastKey.
If the EndKey query does not match it must be greater than the Prefix (as otherwise it would not have been greater than the FirstKey - so set to null.
* Fix unit test
Unit test had a typo - and result interpretation had a misunderstanding.
* Code and spec tidy
Also look to the cover the situation when the FirstKey is the same as the Prefix with tests.
This is, in theory, not an issue as it is the EndKey for each sublist which is indexed in leveled_tree. However, guard against it mapping to null here, just in case there are dangers lurking (note that tests will still pass without `M > N` guard in place.
* Hibernate on BIC complete
There are three situations when the BIC becomes complete:
- In a file created as part of a merge the BIS is learned in the merge
- After startup, files below L1 learn the block cache through reads that happen to read the block, eventually the while cache will be read, unless...
- Either before/after the cache is complete, it can get whiped by a timeout after a get_sqn request (e.g. as prompted by a journal compaction) ... it will then be re-filled of the back of get/get-range requests.
In all these situations we want to hibernate after the BIC is fill - to reflect the fact that the LoopState should now be relatively stable, so it is a good point to GC and rationalise location of data.
Previously on the the first base was covered. Now all three are covered through the bic_complete message.
* Test all index keys have same term
This works functionally, but is not optimised (the term is replicated in the index)
* Summaries with same index term
If the summary index all have the same index term - only the object keys need to be indexes
* Simplify case statements
We either match the pattern of <<Prefix:N, Suffix>> or the answer should be null
* OK for M == N
If M = N for the first key, it will have a suffix of <<>>. This will match (as expected) a query Start Key of the sam size, and be smaller than any query Start Key that has the same prefix.
If the query Start Key does not match the prefix - it will be null - as it must be smaller than the Prefix (as other wise the query Start Key would be bigger than the Last Key).
The constraint of M > N was introduced before the *_prefix_filter functions were checking the prefix, to avoid issues. Now the prefix is being checked, then M == N is ok.
* Simplify
Correct the test to use a binary field in the range.
To avoid further issue, only apply filter when everything is a binary() type.
* Add test for head_only mode
When leveled is used as a tictacaae key store (in parallel mode), the keys will be head_only entries. Double check they are handled as expected like object keys
* Revert previous change - must support typed buckets
Add assertion to confirm worthwhile optimisation
* Add support for configurable cache multiple (#375)
* Mas i370 patch e (#385)
Improvement to monitoring for efficiency and improved readability of logs and stats.
As part of this, where possible, tried to avoid updating loop state on READ messages in leveled processes (as was the case when tracking stats within each process).
No performance benefits found with change, but improved stats has helped discover other potential gains.
2022-12-18 20:18:03 +00:00
|
|
|
{ok, Snap, null} =
|
2022-10-11 13:45:55 +01:00
|
|
|
leveled_bookie:snapshot_store(
|
|
|
|
leveled_bookie:empty_ledgercache(),
|
Develop 3.1 d30update (#386)
* Mas i370 patch d (#383)
* Refactor penciller memory
In high-volume tests on large key-count clusters, so significant variation in the P0031 time has been seen:
TimeBucket PatchA
a.0ms_to_1ms 18554
b.1ms_to_2ms 51778
c.2ms_to_3ms 696
d.3ms_to_5ms 220
e.5ms_to_8ms 59
f.8ms_to_13ms 40
g.13ms_to_21ms 364
h.21ms_to_34ms 277
i.34ms_to_55ms 34
j.55ms_to_89ms 17
k.89ms_to_144ms 21
l.144ms_to_233ms 31
m.233ms_to_377ms 45
n.377ms_to_610ms 52
o.610ms_to_987ms 59
p.987ms_to_1597ms 55
q.1597ms_to_2684ms 54
r.2684ms_to_4281ms 29
s.4281ms_to_6965ms 7
t.6295ms_to_11246ms 1
It is unclear why this varies so much. The time to add to the cache appears to be minimal (but perhaps there is an issue with timing points in the code), whereas the time to add to the index is much more significant and variable. There is also variable time when the memory is rolled (although the actual activity here appears to be minimal.
The refactoring here is two-fold:
- tidy and simplify by keeping LoopState managed within handle_call, and add more helpful dialyzer specs;
- change the update to the index to be a simple extension of a list, rather than any conversion.
This alternative version of the pmem index in unit test is orders of magnitude faster to add - and is the same order of magnitude to check. Anticipation is that it may be more efficient in terms of memory changes.
* Compress SST index
Reduces the size of the leveled_sst index with two changes:
1 - Where there is a common prefix of tuple elements (e.g. Bucket) across the whole leveled_sst file - only the non-common part is indexed, and a function is used to compare.
2 - There is less "indexing" of the index i.e. only 1 in 16 keys are passed into the gb_trees part instead of 1 in 4
* Immediate hibernate
Reasons for delay in hibernate were not clear.
Straight after creation the process will not be in receipt of messages (must wait for the manifest to be updated), so better to hibernate now. This also means the log PC023 provides more accurate information.
* Refactor BIC
This patch avoids the following:
- repeated replacement of the same element in the BIC (via get_kvrange), by checking presence via GET before sing SET
- Stops re-reading of all elements to discover high modified date
Also there appears to have been a bug where a missing HMD for the file is required to add to the cache. However, now the cache may be erased without erasing the HMD. This means that the cache can never be rebuilt
* Use correct size in test results
erts_debug:flat_size/1 returns size in words (i.e. 8 bytes on 64-bit CPU) not bytes
* Don't change summary record
As it is persisted as part of the file write, any change to the summary record cannot be rolled back
* Clerk to prompt L0 write
Simplifies the logic if the clerk request work for the penciller prompts L0 writes as well as Manifest changes.
The advantage now is that if the penciller memory is full, and PUT load stops, the clerk should still be able to prompt persistence. the penciller can therefore make use of dead time this way
* Add push on journal compact
If there has been a backlog, followed by a quiet period - there may be a large ledger cache left unpushed. Journal compaction events are about once per hour, so the performance overhead of a false push should be minimal, with the advantage of clearing any backlog before load starts again.
This is only relevant to riak users with very off/full batch type workloads.
* Extend tests
To more consistently trigger all overload scenarios
* Fix range keys smaller than prefix
Can't make end key an empty binary in this case, as it may be bigger than any keys within the range, but will appear to be smaller.
Unit tests and ct tests added to expose the potential issue
* Tidy-up
- Remove penciller logs which are no longer called
- Get pclerk to only wait MIN_TIMEOUT after doing work, in case there is a backlog
- Remove update_levelzero_cache function as it is unique to handle_call of push_mem, and simple enough to be inline
- Alight testutil slow offer with standard slow offer used
* Tidy-up
Remove pre-otp20 references.
Reinstate the check that the starting pid is still active, this was added to tidy up shutdown.
Resolve failure to run on otp20 due to `-if` sttaement
* Tidy up
Using null rather then {null, Key} is potentially clearer as it is not a concern what they Key is in this case, and removes a comparison step from the leveled_codec:endkey_passed/2 function.
There were issues with coverage in eunit tests as the leveled_pclerk shut down. This prompted a general tidy of leveled_pclerk (remove passing of LoopState into internal functions, and add dialyzer specs.
* Remove R16 relic
* Further testing another issue
The StartKey must always be less than or equal to the prefix when the first N characters are stripped, but this is not true of the EndKey (for the query) which does not have to be between the FirstKey and the LastKey.
If the EndKey query does not match it must be greater than the Prefix (as otherwise it would not have been greater than the FirstKey - so set to null.
* Fix unit test
Unit test had a typo - and result interpretation had a misunderstanding.
* Code and spec tidy
Also look to the cover the situation when the FirstKey is the same as the Prefix with tests.
This is, in theory, not an issue as it is the EndKey for each sublist which is indexed in leveled_tree. However, guard against it mapping to null here, just in case there are dangers lurking (note that tests will still pass without `M > N` guard in place.
* Hibernate on BIC complete
There are three situations when the BIC becomes complete:
- In a file created as part of a merge the BIS is learned in the merge
- After startup, files below L1 learn the block cache through reads that happen to read the block, eventually the while cache will be read, unless...
- Either before/after the cache is complete, it can get whiped by a timeout after a get_sqn request (e.g. as prompted by a journal compaction) ... it will then be re-filled of the back of get/get-range requests.
In all these situations we want to hibernate after the BIC is fill - to reflect the fact that the LoopState should now be relatively stable, so it is a good point to GC and rationalise location of data.
Previously on the the first base was covered. Now all three are covered through the bic_complete message.
* Test all index keys have same term
This works functionally, but is not optimised (the term is replicated in the index)
* Summaries with same index term
If the summary index all have the same index term - only the object keys need to be indexes
* Simplify case statements
We either match the pattern of <<Prefix:N, Suffix>> or the answer should be null
* OK for M == N
If M = N for the first key, it will have a suffix of <<>>. This will match (as expected) a query Start Key of the sam size, and be smaller than any query Start Key that has the same prefix.
If the query Start Key does not match the prefix - it will be null - as it must be smaller than the Prefix (as other wise the query Start Key would be bigger than the Last Key).
The constraint of M > N was introduced before the *_prefix_filter functions were checking the prefix, to avoid issues. Now the prefix is being checked, then M == N is ok.
* Simplify
Correct the test to use a binary field in the range.
To avoid further issue, only apply filter when everything is a binary() type.
* Add test for head_only mode
When leveled is used as a tictacaae key store (in parallel mode), the keys will be head_only entries. Double check they are handled as expected like object keys
* Revert previous change - must support typed buckets
Add assertion to confirm worthwhile optimisation
* Add support for configurable cache multiple (#375)
* Mas i370 patch e (#385)
Improvement to monitoring for efficiency and improved readability of logs and stats.
As part of this, where possible, tried to avoid updating loop state on READ messages in leveled processes (as was the case when tracking stats within each process).
No performance benefits found with change, but improved stats has helped discover other potential gains.
2022-12-18 20:18:03 +00:00
|
|
|
PCLr,
|
|
|
|
null,
|
|
|
|
{no_monitor, 0},
|
|
|
|
ledger,
|
|
|
|
undefined,
|
|
|
|
false),
|
2022-10-11 13:45:55 +01:00
|
|
|
TestPid ! {self(), {ok, Snap, null}},
|
2018-09-06 14:08:09 +01:00
|
|
|
loop();
|
|
|
|
stop ->
|
|
|
|
ok
|
|
|
|
end.
|
|
|
|
|
2016-11-05 15:59:31 +00:00
|
|
|
-endif.
|