Generational compaction #5583

jcoglan · 2025-07-02T14:59:40Z

(This is a draft we're opening for discussion. The bulk of required information on design background, analysis and implementation is in the commits, including some design docs added to the repo. We will flesh this PR out as the feature gets closer to being ready.)

Overview

This PR implements a "generational" storage model in couch_bt_engine, which @janl and I have been working on. Its aim is to improve the performance of compaction on large databases with seldom-changing documents, where every compaction run currently has to copy a mostly-unchanged set of data into the new file.

The generational model splits a shard's data storage into multiple generations, where the usual db.couch file is "generation 0". On compaction, live data in this file is promoted into generation 1. The next time generation 0 is compacted, it does not have to copy the same set of data again has much of it will have been moved to another file.

Further detail on the design and analysis is in design docs we have committed to the repo; see https://github.com/neighbourhoodie/couchdb/blob/feat/generational-compaction/src/couch/doc/generational-compaction. The commit messages give further details about the implementation.

Open questions

What tests need to be added to adequately cover this functionality and make sure there is no risk of data loss?
Since we have made compaction parameterised by a generation, how do we now ensure that only one compaction runs per shard at a time? i.e. you do not end up with generations 1 and 2 of the same shard compacting at the same time, since this would break our consistency assumptions.

Testing recommendations

Related Issues or Pull Requests

Checklist

Code is written and works correctly
Changes are covered by tests
Any new configurable parameters are documented in rel/overlay/etc/default.ini
Documentation changes were made in the src/docs folder
Documentation changes were backported (separated PR) to affected branches

To support a generational storage model, the #st struct needs to have multiple file handles open. Whereas we currently back a shard with a single file, `db.suffix.couch`, the generational model will augment this with a set of "generation" files named `db.1.suffix.couch`, `db.2.suffix.couch`, etc. The original `db.suffix.couch` file is henceforth referred to as "gen-0". Each of these file handles needs to be monitored by the incref/decref functions and so we replace the `fd` and `fd_monitor` fields with a pair of `{fd, monitor}` stored in the `fd` field. The new `gen_fds` field stores a list of such pairs, and points at the `db.{1,2,...}.couch` files. The number of generational files opened is determined by a new field in the DB header named `max_generation`. This defaults to 0 so that all existing databases stay on the current storage model, and need to opt in to using generational storage. Here we also add a set of functions that the engine and compactor will need for managing generational files: - `generation_file_path()`: returns the path to the Nth generation file; returns the normal `db.suffix.couch` path for gen-0. - `open_generation_file()`: opens and monitors the Nth generation file. - `open_generation_files()`: opens and monitors all the files for generations from 1 to N. - `maybe_open_generation_files()`: opens and monitors all the generation files except if the `compacting` option is set; the compactor does not need to re-open the generation files as it will share the existing handles with the engine (i.e. we don't open multiple handles to the same file). - `open_additional_generation_file()`: when compacting the highest generation, we will open an extra temporary file for its live data to be moved into; if `max_generation` = M then this causes `gen_fds` to contain M+1 file handles. - `reopen_generation_file()`: once the file `db.N.couch` has been compacted into `db.N+1.couch`, this function will remove and reopen the existing `db.N.couch` file so that it becomes empty. - `delete_generational_files()`: when deleting the database, this removes all the generational files. - `get_fd()`: returns the file handle for the Nth generation, or the original gen-0 `db.suffix.couch` file.

In the generational storage model, all new docs/revs continue to be written to "gen-0", the `db.suffix.couch` file. On compaction, live data is "promoted" to the next generation; data in `db.couch` is moved to `db.1.couch`, data in `db.1.couch` to `db.2.couch`, etc. Therefore, doc body and attachment pointers need to include a representation of which file they reside in. This is accomplished by storing a pair of `{Gen, Ptr}` instead of just `Ptr` when a body/attachment is written to generation 1 or above. When writing to gen-0, we continue to just store the pointer, rather than wrapping it in `{0, Ptr}`. This means that we continue to write backwards-compatible data for databases that have not opted in to generational storage, and it makes sure we can continue to read existing data, as pointers stored in gen-0 look the same as they always have.

This commit implements the generational compaction scheme wherein live data is "promoted" to a higher generation by the compactor. Each compaction run targets a specific generation N, from 0 up to the database's maximum generation M. If a database has gen-0 file `db.couch`, then the compactor works as follows: - The compactor still creates `db.couch.compact.data` and `db.couch.compact.meta` files. If N = M then it also opens the file `db.M.couch.compact.maxgen`, and this file is added to the end of `gen_fds`, creating a temporary generation M+1 file. - The compactor shares the `gen_fds` file handles with the main DB engine, so that only one file handle exists for these files at a time. Since only the compactor writes to generational files, it may be safe for it to open its own handles, but that is not currently implemented. - All the *structure* of the database -- the by-id and by-seq trees, purge history, metadata, etc -- remains in the gen-0 file, that is, the new structure continues to be built in `db.couch.compact.data`. Only *data*, i.e. document bodies and attachments, is ever stored in a higher generation. - If an attachment is currently stored in gen N, then it is copied into gen N+1. If it resides in a different non-zero generation, it remains where it is. If it resides in gen-0, and N > 0, then it is copied to `db.couch.compact.data`, since the original `db.couch` file will be discarded at the end of compaction. - Document bodies follow the same rule, with one addition: if they contain any attachment pointers that have been moved by the previous rule, then a new copy of the document must be stored with updated attachment pointers. If the document is currently in gen N, then it is copied to gen N+1 with updated attachments. Otherwise, a fresh copy is written to its current generation -- either a generational file, or `db.couch.compact.data`. - If N = M = 0, then doc/attachment data is copied from `db.couch` to `db.couch.compact.data`, rather than to `db.1.couch`. This means compaction continues to work as it currently does for existing databases. - When compaction is complete, `db.couch.compact.data` is moved to `db.couch`. If N > 0 then `db.N.couch` is removed and reopened. Any live data it contained should now reside in `db.N+1.couch`. If N = M, then `db.M.couch.compact.maxgen` is moved to `db.M.couch`, and `gen_fds` reverts to its normal size. - When N = M, i.e. we are compacting the max generation, the target generation will be the M+1 entry in `gen_fds`, but this file will eventually be moved to `db.M.couch`. Therefore we need to write pointers to this file's data with generation M, even though it is at position M+1 in `gen_fds` when it is being written to.

This adds a parameter named `gen` to the `PUT /db` and `POST /db/_compact` endpoints. This sets the `max_generation` of the database when it's created, and sets which generation to compact. The parameter defaults to zero in both endpoints.

In order for smoosh to trigger compactions of generations above 0, we need to store per-generation size information, rather than just storing the total for all the shard's files. The key changes are: - `#full_doc_info.sizes` can now store a list of #size_info rather than a single record. - `couch_db_updater:add_sizes()` uses the generation of the leaf pointer to build a list of #size_info, one for each generation. If there is only a single generation, then a single #size_info is returned, so that we continue to store a single #size_info record for non-generational databases and maximise backwards compatibility. - In `couch_bt_engine`: `get_partition_info()` sums the sizes of each generation to return the total size of the partition shard; `split_sizes()` and `join_sizes()` can work on a list of #size_info as well as a single record; and `reduce_sizes()` can merge two lists of #size_info records. - `couch_db_updater:flush_trees()` and `couch_bt_engine_compactor:copy_docs()` fold the attachment sizes into the active and external sizes when the end result is a multi-generation list of sizes. - `couch_db:get_size_info()` returns a list of #size_info records. The first one is calculated for gen-0 as normal, i.e. the active size is got by adding all the tree sizes to the size of the stored data. For higher generations, the active size is just the size of the stored data. - `fabric_db_info:merge_results()` continues to return a single object for the `sizes` for non-generational databases, but returns an array of per-generation size info for generational ones. - `couch_db_updater:estimate_size()` sums the sizes of all generations to estimate the total size.

Now that we store per-generation size information, we can make smoosh trigger compaction when any generation passes a channel's thresholds. We achieve this by adjusting the events that smoosh reacts to, so that it considers a specific generation for compaction: - When the `updated` event occurs, enqueue the affected database at generation 0, since all new data is written to gen-0. - In `couch_bt_engine:finish_compaction_int()`, we return the compaction's target generation in the result. In `couch_db_engine:finish_compaction()` we use this value to emit a `compacted_into_generation` event. This notifies smoosh that the target generation has gained new data and should be considered for compaction into the generation above it. The generation is then fed into `find_channel()` and `get_priority()` so that these functions examine the correct size information when deciding whether to trigger compaction. We also include the source generation in the compaction's "key" to identify which generation of a DB is being compacted, so that it resumes correctly from pausing or crashing.

jcoglan force-pushed the feat/generational-compaction branch 3 times, most recently from f0666ca to 29f855a Compare July 8, 2025 14:12

jcoglan added 10 commits July 23, 2025 15:36

wip: TODO notes

5a63530

wip: test script

6735e8b

Add design docs written while implementating generational compaction

0350e55

wip: add POST /db/_max_generation

09e16b3

jcoglan force-pushed the feat/generational-compaction branch from 29f855a to 09e16b3 Compare July 23, 2025 15:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Generational compaction #5583

Generational compaction #5583

Uh oh!

jcoglan commented Jul 2, 2025 •

edited

Loading

Uh oh!

Uh oh!

Generational compaction #5583

Are you sure you want to change the base?

Generational compaction #5583

Uh oh!

Conversation

jcoglan commented Jul 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

Open questions

Testing recommendations

Related Issues or Pull Requests

Checklist

Uh oh!

Uh oh!

jcoglan commented Jul 2, 2025 •

edited

Loading