diff --git a/README.md b/README.md index 0a0adda..3d54494 100644 --- a/README.md +++ b/README.md @@ -2,17 +2,19 @@ ## Introduction -Leveled is a work-in-progress prototype of a simple Key-Value store based on the concept of Log-Structured Merge Trees, with the following characteristics: +Leveled is a simple Key-Value store based on the concept of Log-Structured Merge Trees, with the following characteristics: - Optimised for workloads with larger values (e.g. > 4KB). -- Explicitly supports HEAD requests in addition to GET requests. - - Splits the storage of value between keys/metadata and body, +- Explicitly supports HEAD requests in addition to GET requests: + - Splits the storage of value between keys/metadata and body (assuming some definition of metadata is provided); + - Allows for the application to define what constitutes object metadata and what constitutes the body (value-part) of the object - and assign tags to objects to manage multiple object-types with different extract rules; - Stores keys/metadata in a merge tree and the full object in a journal of [CDB files](https://en.wikipedia.org/wiki/Cdb_(software)) - - allowing for HEAD requests which have lower overheads than GET requests, and - - queries which traverse keys/metadatas to be supported with fewer side effects on the page cache. + - allowing for HEAD requests which have lower overheads than GET requests; and + - queries which traverse keys/metadatas to be supported with fewer side effects on the page cache than folds over keys/objects. - Support for tagging of object types and the implementation of alternative store behaviour based on type. + - Allows for changes to extract specific information as metadata to be returned from HEAD requests; - Potentially usable for objects with special retention or merge properties. - Support for low-cost clones without locking to provide for scanning queries (e.g. secondary indexes). @@ -24,6 +26,10 @@ The store has been developed with a focus on being a potential backend to a R An optimised version of Riak KV has been produced in parallel which will exploit the availability of HEAD requests (to access object metadata including version vectors), where a full GET is not required. This, along with reduced write amplification when compared to leveldb, is expected to offer significant improvement in the volume and predictability of throughput for workloads with larger (> 4KB) object sizes, as well as reduced tail latency. +There may be more general uses of Leveled, with the following caveats: + - Leveled should be extended to define new tags that specify what metadata is to be extracted for the inserted objects (or to override the behaviour for the ?STD_TAG). Without this, there will be limited scope to take advantage of the relative efficiency of HEAD and FOLD_HEAD requests. + - If objects are small, the [`head_only` mode](docs/STARTUP_OPTIONS.md#head-only) may be used, which will cease separation of object body from header and use the Key/Metadata store as the only long-term persisted store. In this mode all of the object is treated as Metadata, and the behaviour is closer to that of the leveldb LSM-tree, although with higher median latency. + ## More Details For more details on the store: @@ -71,28 +77,6 @@ More information can be found in the [volume testing section](docs/VOLUME.md). As a general rule though, the most interesting thing is the potential to enable [new features](docs/FUTURE.md). The tagging of different object types, with an ability to set different rules for both compaction and metadata creation by tag, is a potential enabler for further change. Further, having a separate key/metadata store which can be scanned without breaking the page cache or working against mitigation for write amplifications, is also potentially an enabler to offer features to both the developer and the operator. -## Next Steps - -Further volume test scenarios are the immediate priority, in particular volume test scenarios with: - -- Significant use of secondary indexes; - -- Use of newly available [EC2 hardware](https://aws.amazon.com/about-aws/whats-new/2017/02/now-available-amazon-ec2-i3-instances-next-generation-storage-optimized-high-i-o-instances/) which potentially is a significant changes to assumptions about hardware efficiency and cost. - -- Create riak_test tests for new Riak features enabled by leveled. - -However a number of other changes are planned in the next month to (my branch of) riak_kv to better use leveled: - -- Support for rapid rebuild of hashtrees - -- Fixes to [priority issues](https://github.com/martinsumner/leveled/issues) - -- Experiments with flexible sync on write settings - -- A cleaner and easier build of Riak with leveled included, including cuttlefish configuration support - -More information can be found in the [future section](docs/FUTURE.md). - ## Feedback Please create an issue if you have any suggestions. You can ping me @masleeds if you wish @@ -104,28 +88,14 @@ Unit and current tests in leveled should run with rebar3. Leveled has been test A new database can be started by running ``` -{ok, Bookie} = leveled_bookie:book_start(RootPath, LedgerCacheSize, JournalSize, SyncStrategy) +{ok, Bookie} = leveled_bookie:book_start(StartupOptions) ``` -This will start a new Bookie. It will start and look for existing data files, under the RootPath, and start empty if none exist. A LedgerCacheSize of `2000`, a JournalSize of `500000000` (500MB) and a SyncStrategy of `none` should work OK. Further information on startup options can be found [here](docs/STARTUP_OPTIONS.md). +This will start a new Bookie. It will start and look for existing data files, under the RootPath, and start empty if none exist. Further information on startup options can be found here [here](docs/STARTUP_OPTIONS.md). The book_start method should respond once startup is complete. The [leveled_bookie module](src/leveled_bookie.erl) includes the full API for external use of the store. -It should run anywhere that OTP will run - it has been tested on Ubuntu 14, MAC OS X and Windows 10. - -Running in Riak requires one of the branches of riak_kv referenced [here](docs/FUTURE.md). There is a [Riak branch](https://github.com/martinsumner/riak/tree/mas-leveleddb) intended to support the automatic build of this, and the configuration via cuttlefish. However, the auto-build fails due to other dependencies (e.g. riak_search) bringing in an alternative version of riak_kv, and the configuration via cuttlefish is broken for reasons unknown. - -Building this from source as part of Riak will require a bit of fiddling around. - -- clone and build [riak](https://github.com/martinsumner/riak/tree/mas-leveleddb) -- cd deps -- rm -rf riak_kv -- git clone -b mas-leveled-putfsm --single-branch https://github.com/martinsumner/riak_kv.git -- cd .. -- make rel -- remember to set the storage backend to leveled in riak.conf - -To help with the breakdown of cuttlefish, leveled parameters can be set via riak_kv/include/riak_kv_leveled.hrl - although a new make will be required for these changes to take effect. +Running in Riak requires Riak 2.9 or beyond, which is available from January 2019. ### Contributing @@ -136,5 +106,4 @@ ct with 100% coverage. To have rebar3 execute the full set of tests, run: - rebar3 as test do cover --reset, eunit --cover, ct --cover, cover --verbose - + `rebar3 as test do cover --reset, eunit --cover, ct --cover, cover --verbose` diff --git a/docs/DESIGN.md b/docs/DESIGN.md index 4440dd6..8e8c142 100644 --- a/docs/DESIGN.md +++ b/docs/DESIGN.md @@ -82,6 +82,8 @@ Three types are initially supported: All Ledger Keys created for any type must be 4-tuples starting with the tag. Abstraction with regards to types is currently imperfect, but the expectation is that these types will make support for application specific behaviours easier to achieve, such as behaviours which maybe required to support different [CRDTs](https://en.wikipedia.org/wiki/Conflict-free_replicated_data_type). +Currently user-defined tags are supported as an experimental feature along with the ability to override the function which controls how metadata is split from the object value. Good choice of metadata is important to ensure the improved efficiency of folds over heads (as opposed to folds over objects), and the use of HEAD requests (as opposed to GET requests), can be exploited by applications using leveled. + ## GET/PUT Paths The PUT path for new objects and object changes depends on the Bookie interacting with the Inker to ensure that the change has been persisted with the Journal, the Ledger is updated in batches after the PUT has been completed. @@ -128,7 +130,7 @@ Backups are taken of the Journal only, as the Ledger can be recreated on startu The backup uses hard-links, so at the point the backup is taken, there will be a minimal change to the on-disk footprint of the store. However, as journal compaction is run, the hard-links will prevent space from getting released by the dropping of replaced journal files - so backups will cause the size of the store to grow faster than it would otherwise do. It is an operator responsibility to garbage collect old backups, to prevent this growth from being an issue. -As backups depend on hard-links, they cannot be taken with a `BackupPath` on a different file system to the standard data path. The move a backup across to a different file system, standard tools should be used such as rsync. The leveled backups should be relatively friendly for rsync-like delta-based backup approaches due to significantly lower write amplification when compared to other LSM stores (e.g. leveldb). +As backups depend on hard-links, they cannot be taken with a `BackupPath` on a different file system to the standard data path. The move a backup across to a different file system, standard tools should be used such as rsync. The leveled backups should be relatively friendly for rsync-like delta-based backup approaches due to significantly lower write amplification when compared to other LSM stores (e.g. leveldb). ## Head only diff --git a/docs/STARTUP_OPTIONS.md b/docs/STARTUP_OPTIONS.md index a8d7998..b946385 100644 --- a/docs/STARTUP_OPTIONS.md +++ b/docs/STARTUP_OPTIONS.md @@ -22,6 +22,18 @@ There is no stats facility within leveled, the stats are only available from the The `forced_logs` option will force a particular log reference to be logged regardless of the log level that has been set. This can be used to log at a higher level than `info`, whilst allowing for specific logs to still be logged out, such as logs providing sample performance statistics. +## User-Defined Tags + +There are 2 primary object tags - ?STD_TAG (o) which is the default, and ?RIAK_TAG (o_rkv). Objects PUT into the store with different tags may have different behaviours in leveled. + +The differences between tags are encapsulated within the `leveled_head` module. The primary difference of interest is the alternative handling within the function `extract_metadata/3`. Significant efficiency can be gained in leveled (as opposed to other LSM-stores) through using book_head requests when book_get would otherwise be necessary. If 80% of the requests are interested in less than 20% of the information within an object, then having that 20% in the object metadata and switching fetch requests to the book_head API, will improve efficiency. Also folds over heads are much more efficient that folds over objects, so significant improvements can be also be made within folds by having the right information within the metadata. + +To make use of this efficiency, metadata needs to be extracted on PUT, and made into leveled object metadata. For the ?RIAK_TAG this work is within the `leveled_head` module. If an application wants to control this behaviour for its application, then a tag can be created, and the `leveled_head` module updated. However, it is also possible to have more dynamic definitions for handling of application-defined tags, by passing in alternative versions of one or more of the functions `extract_metadata/3`, `build_head/1` and `key_to_canonicalbinary/1` on start-up. These functions will be applied to user-defined tags (but will not override the behaviour for pre-defined tags). + +The startup option `override_functions` can be used to manage this override. [This test](../test/end_to_end/appdefined_SUITE.erl) provides a simple example of using override_functions. + +This option is currently experimental. Issues such as versioning, and handling a failure to consistently start a store with the same override_functions, should be handled by the application. + ## Max Journal Size The maximum size of an individual Journal file can be set using `{max_journalsize, integer()}`, which sets the size in bytes. The default value is 1,000,000,000 (~1GB). The maximum size, which cannot be exceed is `2^32`. It is not expected that the Journal Size should normally set to lower than 100 MB, it should be sized to hold many thousands of objects at least. @@ -61,13 +73,13 @@ The purpose of the reload strategy is to define the behaviour at compaction of t By default nothing is compacted from the Journal if the SQN of the Journal entry is greater than the largest sequence number which has been persisted in the Ledger. So when an object is compacted in the Journal (as it has been replaced), it should not need to be replayed from the Journal into the Ledger in the future - as it, and all its related key changes, have already been persisted to the Ledger. -However, what if the Ledger had been erased? This could happen due to some corruption, or perhaps because only the Journal is to be backed up. As the object has been replaced, the value is not required - however KeyChanges ay be required (such as indexes which are built incrementally across a series of object changes). So to revert the indexes to their previous state the Key Changes would need to be retained in this case, so the indexes in the Ledger would be correctly rebuilt. +However, what if the Ledger had been erased? This could happen due to some corruption, or perhaps because only the Journal is to be backed up. As the object has been replaced, the value is not required - however KeyChanges may be required (such as indexes which are built incrementally across a series of object changes). So to revert the indexes to their previous state the Key Changes would need to be retained in this case, so the indexes in the Ledger would be correctly rebuilt. The are three potential strategies: -`skip` - don't worry about this scenario, require the Ledger to be backed up; -`retain` - discard the object itself on compaction but keep the key changes; -`recalc` - recalculate the indexes on reload by comparing the information on the object with the current state of the Ledger (as would be required by the PUT process when comparing IndexSpecs at PUT time). + - `skip` - don't worry about this scenario, require the Ledger to be backed up; + - `retain` - discard the object itself on compaction but keep the key changes; + - `recalc` - recalculate the indexes on reload by comparing the information on the object with the current state of the Ledger (as would be required by the PUT process when comparing IndexSpecs at PUT time). There is no code for `recalc` at present it is simply a logical possibility. So to set a reload strategy there should be an entry like `{reload_strategy, [{TagName, skip|retain}]}`. By default tags are pre-set to `retain`. If there is no need to handle a corrupted Ledger, then all tags could be set to `skip`.