* Mas i370 patch d (#383)

* Refactor penciller memory

In high-volume tests on large key-count clusters, so significant variation in the P0031 time has been seen:

TimeBucket PatchA
a.0ms_to_1ms 18554
b.1ms_to_2ms 51778
c.2ms_to_3ms 696
d.3ms_to_5ms 220
e.5ms_to_8ms 59
f.8ms_to_13ms 40
g.13ms_to_21ms 364
h.21ms_to_34ms 277
i.34ms_to_55ms 34
j.55ms_to_89ms 17
k.89ms_to_144ms 21
l.144ms_to_233ms 31
m.233ms_to_377ms 45
n.377ms_to_610ms 52
o.610ms_to_987ms 59
p.987ms_to_1597ms 55
q.1597ms_to_2684ms 54
r.2684ms_to_4281ms 29
s.4281ms_to_6965ms 7
t.6295ms_to_11246ms 1

It is unclear why this varies so much. The time to add to the cache appears to be minimal (but perhaps there is an issue with timing points in the code), whereas the time to add to the index is much more significant and variable. There is also variable time when the memory is rolled (although the actual activity here appears to be minimal.

The refactoring here is two-fold:

- tidy and simplify by keeping LoopState managed within handle_call, and add more helpful dialyzer specs;

- change the update to the index to be a simple extension of a list, rather than any conversion.

This alternative version of the pmem index in unit test is orders of magnitude faster to add - and is the same order of magnitude to check. Anticipation is that it may be more efficient in terms of memory changes.

* Compress SST index

Reduces the size of the leveled_sst index with two changes:

1 - Where there is a common prefix of tuple elements (e.g. Bucket) across the whole leveled_sst file - only the non-common part is indexed, and a function is used to compare.

2 - There is less "indexing" of the index i.e. only 1 in 16 keys are passed into the gb_trees part instead of 1 in 4

* Immediate hibernate

Reasons for delay in hibernate were not clear.

Straight after creation the process will not be in receipt of messages (must wait for the manifest to be updated), so better to hibernate now. This also means the log PC023 provides more accurate information.

* Refactor BIC

This patch avoids the following:

- repeated replacement of the same element in the BIC (via get_kvrange), by checking presence via GET before sing SET

- Stops re-reading of all elements to discover high modified date

Also there appears to have been a bug where a missing HMD for the file is required to add to the cache. However, now the cache may be erased without erasing the HMD. This means that the cache can never be rebuilt

* Use correct size in test results

erts_debug:flat_size/1 returns size in words (i.e. 8 bytes on 64-bit CPU) not bytes

* Don't change summary record

As it is persisted as part of the file write, any change to the summary record cannot be rolled back

* Clerk to prompt L0 write

Simplifies the logic if the clerk request work for the penciller prompts L0 writes as well as Manifest changes.

The advantage now is that if the penciller memory is full, and PUT load stops, the clerk should still be able to prompt persistence. the penciller can therefore make use of dead time this way

* Add push on journal compact

If there has been a backlog, followed by a quiet period - there may be a large ledger cache left unpushed. Journal compaction events are about once per hour, so the performance overhead of a false push should be minimal, with the advantage of clearing any backlog before load starts again.

This is only relevant to riak users with very off/full batch type workloads.

* Extend tests

To more consistently trigger all overload scenarios

* Fix range keys smaller than prefix

Can't make end key an empty binary in this case, as it may be bigger than any keys within the range, but will appear to be smaller.

Unit tests and ct tests added to expose the potential issue

* Tidy-up

- Remove penciller logs which are no longer called
- Get pclerk to only wait MIN_TIMEOUT after doing work, in case there is a backlog
- Remove update_levelzero_cache function as it is unique to handle_call of push_mem, and simple enough to be inline
- Alight testutil slow offer with standard slow offer used

* Tidy-up

Remove pre-otp20 references.

Reinstate the check that the starting pid is still active, this was added to tidy up shutdown.

Resolve failure to run on otp20 due to `-if` sttaement

* Tidy up

Using null rather then {null, Key} is potentially clearer as it is not a concern what they Key is in this case, and removes a comparison step from the leveled_codec:endkey_passed/2 function.

There were issues with coverage in eunit tests as the leveled_pclerk shut down. This prompted a general tidy of leveled_pclerk (remove passing of LoopState into internal functions, and add dialyzer specs.

* Remove R16 relic

* Further testing another issue

The StartKey must always be less than or equal to the prefix when the first N characters are stripped, but this is not true of the EndKey (for the query) which does not have to be between the FirstKey and the LastKey.

If the EndKey query does not match it must be greater than the Prefix (as otherwise it would not have been greater than the FirstKey - so set to null.

* Fix unit test

Unit test had a typo - and result interpretation had a misunderstanding.

* Code and spec tidy

Also look to the cover the situation when the FirstKey is the same as the Prefix with tests.

This is, in theory, not an issue as it is the EndKey for each sublist which is indexed in leveled_tree. However, guard against it mapping to null here, just in case there are dangers lurking (note that tests will still pass without `M > N` guard in place.

* Hibernate on BIC complete

There are three situations when the BIC becomes complete:

- In a file created as part of a merge the BIS is learned in the merge
- After startup, files below L1 learn the block cache through reads that happen to read the block, eventually the while cache will be read, unless...
- Either before/after the cache is complete, it can get whiped by a timeout after a get_sqn request (e.g. as prompted by a journal compaction) ... it will then be re-filled of the back of get/get-range requests.

In all these situations we want to hibernate after the BIC is fill - to reflect the fact that the LoopState should now be relatively stable, so it is a good point to GC and rationalise location of data.

Previously on the the first base was covered. Now all three are covered through the bic_complete message.

* Test all index keys have same term

This works functionally, but is not optimised (the term is replicated in the index)

* Summaries with same index term

If the summary index all have the same index term - only the object keys need to be indexes

* Simplify case statements

We either match the pattern of <<Prefix:N, Suffix>> or the answer should be null

* OK for M == N

If M = N for the first key, it will have a suffix of <<>>. This will match (as expected) a query Start Key of the sam size, and be smaller than any query Start Key that has the same prefix.

If the query Start Key does not match the prefix - it will be null - as it must be smaller than the Prefix (as other wise the query Start Key would be bigger than the Last Key).

The constraint of M > N was introduced before the *_prefix_filter functions were checking the prefix, to avoid issues. Now the prefix is being checked, then M == N is ok.

* Simplify

Correct the test to use a binary field in the range.

To avoid further issue, only apply filter when everything is a binary() type.

* Add test for head_only mode

When leveled is used as a tictacaae key store (in parallel mode), the keys will be head_only entries. Double check they are handled as expected like object keys

* Revert previous change - must support typed buckets

Add assertion to confirm worthwhile optimisation

* Add support for configurable cache multiple (#375)

* Mas i370 patch e (#385)

Improvement to monitoring for efficiency and improved readability of logs and stats.

As part of this, where possible, tried to avoid updating loop state on READ messages in leveled processes (as was the case when tracking stats within each process).

No performance benefits found with change, but improved stats has helped discover other potential gains.

2022-12-18 20:18:03 +00:00

14 KiB

Raw Blame History

Starting Leveled

There are a number of options that can be passed in when starting Leveled, this is an explainer of these options and what they do. The options are passed in a list of {option_name, Option} tuples on startup.

Head Only

Starting with {head_only, true} (defaults to false), will start Leveled in a special mode. In this mode Leveled works a lot more like Leveldb, in that the Journal is just a buffer of recent writes to be used to recover objects on startup. The actual object value is now stored in the LSM tree itself rather than in the Journal.

Objects need to be put into the Leveled store using the book_mput/2 or book_mput/3 when running in head_only mode.

This mode was specifically added to support Leveled's use as a dedicated aae_store in the kv_index_tictactree library. It may be extensible for other uses where objects are small.

There is no current support for running leveled so that it supports both head objects which are stored entirely in the Ledger, along side other objects stored as normal split between the Journal and the Ledger. Setting head_only fundamentally changes the way the store works.

Log Level

The log level can be set to debug, info, warn, error, critical. The info log level will generate a significant amount of logs, and in testing this volume of logs has not currently been shown to be detrimental to performance. The log level has been set to be 'noisy' in this way to suit environments which make use of log indexers which can consume large volumes of logs, and allow operators freedom to build queries and dashboards from those indexes.

There is no stats facility within leveled, the stats are only available from the logs. In the future, a stats facility may be added to provide access to this information without having to run at info log levels. [Forced Logs](#Forced Logs) may be used to add stats or other info logs selectively.

Forced logs

The forced_logs option will force a particular log reference to be logged regardless of the log level that has been set. This can be used to log at a higher level than info, whilst allowing for specific logs to still be logged out, such as logs providing sample performance statistics.

User-Defined Tags

There are 2 primary object tags - ?STD_TAG (o) which is the default, and ?RIAK_TAG (o_rkv). Objects PUT into the store with different tags may have different behaviours in leveled.

The differences between tags are encapsulated within the leveled_head module. The primary difference of interest is the alternative handling within the function extract_metadata/3. Significant efficiency can be gained in leveled (as opposed to other LSM-stores) through using book_head requests when book_get would otherwise be necessary. If 80% of the requests are interested in less than 20% of the information within an object, then having that 20% in the object metadata and switching fetch requests to the book_head API, will improve efficiency. Also folds over heads are much more efficient that folds over objects, so significant improvements can be also be made within folds by having the right information within the metadata.

To make use of this efficiency, metadata needs to be extracted on PUT, and made into leveled object metadata. For the ?RIAK_TAG this work is within the leveled_head module. If an application wants to control this behaviour for its application, then a tag can be created, and the leveled_head module updated. However, it is also possible to have more dynamic definitions for handling of application-defined tags, by passing in alternative versions of one or more of the functions extract_metadata/3, build_head/1 and key_to_canonicalbinary/1 on start-up. These functions will be applied to user-defined tags (but will not override the behaviour for pre-defined tags).

The startup option override_functions can be used to manage this override. This test provides a simple example of using override_functions.

This option is currently experimental. Issues such as versioning, and handling a failure to consistently start a store with the same override_functions, should be handled by the application.

Max Journal Size

The maximum size of an individual Journal file can be set using {max_journalsize, integer()}, which sets the size in bytes. The default value is 1,000,000,000 (~1GB). The maximum size, which cannot be exceed is 2^32. It is not expected that the Journal Size should normally set to lower than 100 MB, it should be sized to hold many thousands of objects at least.

If there are smaller objects, then lookups within a Journal may get faster if each individual journal file is smaller. Generally there should be o(100K) objects per journal, to control the maximum size of the hash table within each file. Signs that the journal size is too high may include:

excessive CPU use and related performance impacts during rolling of CDB files, see log CDB07;
excessive load caused during journal compaction despite tuning down max_run_length.

If the store is used to hold bigger objects, the max_journalsize may be scaled up accordingly. Having fewer Journal files (by using larger objects), will reduce the lookup time to find the right Journal during GET requests, but in most circumstances the impact of this improvement is marginal. The primary impact of fewer Journal files is the decision-making time of Journal compaction (the time to calculate if a compaction should be undertaken, then what should be compacted) will increase. The timing for making compaction calculations can be monitored through log IC003.

Ledger Cache Size

The option {cache_size, integer()} is the number of ledger objects that should be cached by the Bookie actor, before being pushed to the Ledger. Note these are ledger objects (so do not normally contain the actual object value, but does include index changes as separate objects). The default value is 2500.

Penciller Cache Size

The option {max_pencillercachesize, integer()} sets the approximate number of options that should be kept in the penciller memory before it flushes that memory to disk. Note, when this limit is reached, the persist may be delayed by a some random jittering to prevent coordination between multiple stores in the same cluster.

The default number of objects is 28,000. A small number may be required if there is a particular shortage of memory. Note that this is just Ledger objects (so the actual values are not stored at in memory as part of this cache).

File Write Sync Strategy

The sync strategy can be set as {sync_strategy, sync|riak_sync|none}. This controls whether each write requires that write to be flushed to disk before the write is acknowledged. If none is set flushing to disk is left in the hands of the operating system. riak_sync is a deprecated option (it is related to the lack of sync flag in OTP 16, and will prompt the flush after the write, rather than as part of the write operation).

The default is sync. Note, that without solid state drives and/or Flash-Backed Write Caches, this option will have a significant impact on performance.

Waste Retention Period

The waste retention period can be used to keep old journal files that have already been compacted for that period. This might be useful if there is a desire to backup a machine to be restorable to a particular point in time (by clearing the ledger, and reverting the inker manifest).

The retention period can be set using {waste_retention_period, integer()} where the value is the period in seconds. If left as undefined all files will be garbage collected on compaction, and no waste will be retained.

Reload Strategy

The purpose of the reload strategy is to define the behaviour at compaction of the Journal on finding a replaced record, in order to manage the behaviour when reloading the Ledger from the Journal.

By default nothing is compacted from the Journal if the SQN of the Journal entry is greater than the largest sequence number which has been persisted in the Ledger. So when an object is compacted in the Journal (as it has been replaced), it should not need to be replayed from the Journal into the Ledger in the future - as it, and all its related key changes, have already been persisted to the Ledger.

However, what if the Ledger had been erased? This could happen due to some corruption, or perhaps because only the Journal is to be backed up. As the object has been replaced, the value is not required - however KeyChanges may be required (such as indexes which are built incrementally across a series of object changes). So to revert the indexes to their previous state the Key Changes would need to be retained in this case, so the indexes in the Ledger would be correctly rebuilt.

The are three potential strategies:

recovr - don't worry about this scenario, require the Ledger to be backed up;
retain - discard the object itself on compaction but keep the key changes;
recalc - recalculate the indexes on reload by comparing the information on the object with the current state of the Ledger (as would be required by the PUT process when comparing IndexSpecs at PUT time).

To set a reload strategy requires a list of tuples to match tag names to strategy {reload_strategy, [{TagName, recovr|retain|recalc}]}. By default tags are pre-set to retain. If there is no need to handle a corrupted Ledger, then all tags could be set to recovr - this assumes that either the ledger files are protected by some other means from corruption, or an external anti-entropy mechanism will recover the lost data.

Compression Method

Compression method can be set to native or lz4 (i.e. {compression_method, native|lz4}). Native compression will use the compress option in Erlangs native term_to_binary/2 function, whereas lz4 compression will use a NIF'd Lz4 library.

This is the compression used both when writing an object to the jourrnal, and a block of keys to the ledger. There is a throughput advantage of around 2 - 5 % associated with using lz4 compression.

Compression Point

Compression point can be set using {compression_point, on_receipt|on_compact}. This refers only to compression in the Journal, key blocks are always compressed in the ledger. The option is whether to accept additional PUT latency by compressing as objects are received, or defer the compressing of objects in the Journal until they are re-written as part of a compaction (which may never happen).

Root Path

The root path is the name of the folder in which the database has been (or should be) persisted.

Journal Compaction

The compaction of the Journal, is the process through which the space of replaced (or deleted) objects can be reclaimed from the journal. This is controlled through the following parameters:

The compaction_runs_perday indicates for the leveled store how many times eahc day it will attempt to run a compaction (it is normal for this to be ~= the numbe rof hours per day that compcation is permitted).

The compaction_low_hour and compaction_high_hour are the hours of the day which support the compaction window - set to 0 and 23 respectively if compaction is required to be a continuous process.

The max_run_length controls how many files can be compacted in a single compaction run. The scoring of files and runs is controlled through maxrunlength_compactionpercentage and singlefile_compactionpercentage. The singlefile_compactionpercentage is an acceptable compaction score for a file to be eligible for compaction on its own, where as the maxrunlength_compactionpercentage is the score required for a run of the max_run_length to be considered eligible. The higher the maxrunlength_compactionpercentage and the lower the singlefile_compactionpercentage - the more likely a longer run will be chosen over a shorter run.

The journalcompaction_scoreonein option controls how frequently a file will be scored. If this is set to one, then each and every file will be scored each and every compaction run. If this is set to an integer greater than one ('n'), then on average any given file will only be score on one in 'n' runs. On other runs. a cached score for the file will be used. On startup all files will be scored on the first run. As journals get very large, and where frequent comapction is required due to mutating objects, this can save significant resource. In Riak, this option is controlled via leveled.compaction_scores_perday, with the number of leveled.compaction_runs_perday being divided by this to produce the journalcompaction_scoreonein. By default each file will only be scored once per day.

Snapshot Timeouts

There are two snapshot timeouts that can be configured:

snapshot_timeout_short
snapshot_timeout_long

These set the period in seconds before a snapshot which has not shutdown, is declared to have been released - so that any file deletions which are awaiting the snapshot's completion can go ahead.

This covers only silently failing snapshots. Snapshots that shutdown neatly will be released from locking deleted files when they shutdown. The 'short' timeout is used for snapshots which support index queries and bucket listing. The 'long' timeout is used for all other folds (e.g. key lists, head folds and object folds).

Statistic gathering

Leveled will gather monitoring statistics on HEAD/GET/PUT requests, with timing points taken throughout the store. These timings are gathered by the leveled_monitor, and there are three configuration options. The two primary options are: stats_percentage is an integer between 0 and 100 which informs the store of the proprtion of the requests which should be timed at each part; and stats_logfrequency which controls the frequency (in seconds) with which the leveled_monitor will write a log file (for one of the stats types in its queue).

The specific stats types logged can be found in the ?LOG_LIST within the leveled_monitor. If a subset only is of interest, than this list can be modified by setting monitor_loglist. This can also be used to repeat the frequency of individual log types by adding them to the list multiple times.

14 KiB Raw Blame History