Skip to content

Generational compaction #5583

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 10 commits into
base: main
Choose a base branch
from

Conversation

jcoglan
Copy link
Contributor

@jcoglan jcoglan commented Jul 2, 2025

(This is a draft we're opening for discussion. The bulk of required information on design background, analysis and implementation is in the commits, including some design docs added to the repo. We will flesh this PR out as the feature gets closer to being ready.)

Overview

This PR implements a "generational" storage model in couch_bt_engine, which @janl and I have been working on. Its aim is to improve the performance of compaction on large databases with seldom-changing documents, where every compaction run currently has to copy a mostly-unchanged set of data into the new file.

The generational model splits a shard's data storage into multiple generations, where the usual db.couch file is "generation 0". On compaction, live data in this file is promoted into generation 1. The next time generation 0 is compacted, it does not have to copy the same set of data again has much of it will have been moved to another file.

Further detail on the design and analysis is in design docs we have committed to the repo; see https://github.com/neighbourhoodie/couchdb/blob/feat/generational-compaction/src/couch/doc/generational-compaction. The commit messages give further details about the implementation.

Open questions

  • What tests need to be added to adequately cover this functionality and make sure there is no risk of data loss?
  • Since we have made compaction parameterised by a generation, how do we now ensure that only one compaction runs per shard at a time? i.e. you do not end up with generations 1 and 2 of the same shard compacting at the same time, since this would break our consistency assumptions.

Testing recommendations

Related Issues or Pull Requests

Checklist

  • Code is written and works correctly
  • Changes are covered by tests
  • Any new configurable parameters are documented in rel/overlay/etc/default.ini
  • Documentation changes were made in the src/docs folder
  • Documentation changes were backported (separated PR) to affected branches

@jcoglan jcoglan force-pushed the feat/generational-compaction branch 3 times, most recently from f0666ca to 29f855a Compare July 8, 2025 14:12
jcoglan added 10 commits July 23, 2025 15:36
To support a generational storage model, the #st struct needs to have
multiple file handles open. Whereas we currently back a shard with a
single file, `db.suffix.couch`, the generational model will augment this
with a set of "generation" files named `db.1.suffix.couch`,
`db.2.suffix.couch`, etc. The original `db.suffix.couch` file is
henceforth referred to as "gen-0".

Each of these file handles needs to be monitored by the incref/decref
functions and so we replace the `fd` and `fd_monitor` fields with a pair
of `{fd, monitor}` stored in the `fd` field. The new `gen_fds` field
stores a list of such pairs, and points at the `db.{1,2,...}.couch`
files.

The number of generational files opened is determined by a new field in
the DB header named `max_generation`. This defaults to 0 so that all
existing databases stay on the current storage model, and need to opt in
to using generational storage.

Here we also add a set of functions that the engine and compactor will
need for managing generational files:

- `generation_file_path()`: returns the path to the Nth generation file;
  returns the normal `db.suffix.couch` path for gen-0.

- `open_generation_file()`: opens and monitors the Nth generation file.

- `open_generation_files()`: opens and monitors all the files for
  generations from 1 to N.

- `maybe_open_generation_files()`: opens and monitors all the generation
  files except if the `compacting` option is set; the compactor does not
  need to re-open the generation files as it will share the existing
  handles with the engine (i.e. we don't open multiple handles to the
  same file).

- `open_additional_generation_file()`: when compacting the highest
  generation, we will open an extra temporary file for its live data to
  be moved into; if `max_generation` = M then this causes `gen_fds` to
  contain M+1 file handles.

- `reopen_generation_file()`: once the file `db.N.couch` has been
  compacted into `db.N+1.couch`, this function will remove and reopen
  the existing `db.N.couch` file so that it becomes empty.

- `delete_generational_files()`: when deleting the database, this
  removes all the generational files.

- `get_fd()`: returns the file handle for the Nth generation, or the
  original gen-0 `db.suffix.couch` file.
In the generational storage model, all new docs/revs continue to be
written to "gen-0", the `db.suffix.couch` file. On compaction, live data
is "promoted" to the next generation; data in `db.couch` is moved to
`db.1.couch`, data in `db.1.couch` to `db.2.couch`, etc. Therefore, doc
body and attachment pointers need to include a representation of which
file they reside in.

This is accomplished by storing a pair of `{Gen, Ptr}` instead of just
`Ptr` when a body/attachment is written to generation 1 or above. When
writing to gen-0, we continue to just store the pointer, rather than
wrapping it in `{0, Ptr}`. This means that we continue to write
backwards-compatible data for databases that have not opted in to
generational storage, and it makes sure we can continue to read existing
data, as pointers stored in gen-0 look the same as they always have.
This commit implements the generational compaction scheme wherein live
data is "promoted" to a higher generation by the compactor. Each
compaction run targets a specific generation N, from 0 up to the
database's maximum generation M. If a database has gen-0 file
`db.couch`, then the compactor works as follows:

- The compactor still creates `db.couch.compact.data` and
  `db.couch.compact.meta` files. If N = M then it also opens the file
  `db.M.couch.compact.maxgen`, and this file is added to the end of
  `gen_fds`, creating a temporary generation M+1 file.

- The compactor shares the `gen_fds` file handles with the main DB
  engine, so that only one file handle exists for these files at a time.
  Since only the compactor writes to generational files, it may be safe
  for it to open its own handles, but that is not currently implemented.

- All the *structure* of the database -- the by-id and by-seq trees,
  purge history, metadata, etc -- remains in the gen-0 file, that is,
  the new structure continues to be built in `db.couch.compact.data`.
  Only *data*, i.e. document bodies and attachments, is ever stored in a
  higher generation.

- If an attachment is currently stored in gen N, then it is copied into
  gen N+1. If it resides in a different non-zero generation, it remains
  where it is. If it resides in gen-0, and N > 0, then it is copied to
  `db.couch.compact.data`, since the original `db.couch` file will be
  discarded at the end of compaction.

- Document bodies follow the same rule, with one addition: if they
  contain any attachment pointers that have been moved by the previous
  rule, then a new copy of the document must be stored with updated
  attachment pointers. If the document is currently in gen N, then it is
  copied to gen N+1 with updated attachments. Otherwise, a fresh copy is
  written to its current generation -- either a generational file, or
  `db.couch.compact.data`.

- If N = M = 0, then doc/attachment data is copied from `db.couch` to
  `db.couch.compact.data`, rather than to `db.1.couch`. This means
  compaction continues to work as it currently does for existing
  databases.

- When compaction is complete, `db.couch.compact.data` is moved to
  `db.couch`. If N > 0 then `db.N.couch` is removed and reopened. Any
  live data it contained should now reside in `db.N+1.couch`. If N = M,
  then `db.M.couch.compact.maxgen` is moved to `db.M.couch`, and
  `gen_fds` reverts to its normal size.

- When N = M, i.e. we are compacting the max generation, the target
  generation will be the M+1 entry in `gen_fds`, but this file will
  eventually be moved to `db.M.couch`. Therefore we need to write
  pointers to this file's data with generation M, even though it is at
  position M+1 in `gen_fds` when it is being written to.
This adds a parameter named `gen` to the `PUT /db` and `POST
/db/_compact` endpoints. This sets the `max_generation` of the database
when it's created, and sets which generation to compact. The parameter
defaults to zero in both endpoints.
In order for smoosh to trigger compactions of generations above 0, we
need to store per-generation size information, rather than just storing
the total for all the shard's files.

The key changes are:

- `#full_doc_info.sizes` can now store a list of #size_info rather than
  a single record.

- `couch_db_updater:add_sizes()` uses the generation of the leaf pointer
  to build a list of #size_info, one for each generation. If there is
  only a single generation, then a single #size_info is returned, so
  that we continue to store a single #size_info record for
  non-generational databases and maximise backwards compatibility.

- In `couch_bt_engine`: `get_partition_info()` sums the sizes of each
  generation to return the total size of the partition shard;
  `split_sizes()` and `join_sizes()` can work on a list of #size_info as
  well as a single record; and `reduce_sizes()` can merge two lists of
  #size_info records.

- `couch_db_updater:flush_trees()` and
  `couch_bt_engine_compactor:copy_docs()` fold the attachment sizes
  into the active and external sizes when the end result is a
  multi-generation list of sizes.

- `couch_db:get_size_info()` returns a list of #size_info records. The
  first one is calculated for gen-0 as normal, i.e. the active size is
  got by adding all the tree sizes to the size of the stored data. For
  higher generations, the active size is just the size of the stored
  data.

- `fabric_db_info:merge_results()` continues to return a single object
  for the `sizes` for non-generational databases, but returns an array
  of per-generation size info for generational ones.

- `couch_db_updater:estimate_size()` sums the sizes of all generations
  to estimate the total size.
Now that we store per-generation size information, we can make smoosh
trigger compaction when any generation passes a channel's thresholds. We
achieve this by adjusting the events that smoosh reacts to, so that it
considers a specific generation for compaction:

- When the `updated` event occurs, enqueue the affected database at
  generation 0, since all new data is written to gen-0.

- In `couch_bt_engine:finish_compaction_int()`, we return the
  compaction's target generation in the result. In
  `couch_db_engine:finish_compaction()` we use this value to emit a
  `compacted_into_generation` event. This notifies smoosh that the
  target generation has gained new data and should be considered for
  compaction into the generation above it.

The generation is then fed into `find_channel()` and `get_priority()` so
that these functions examine the correct size information when deciding
whether to trigger compaction.

We also include the source generation in the compaction's "key" to
identify which generation of a DB is being compacted, so that it resumes
correctly from pausing or crashing.
@jcoglan jcoglan force-pushed the feat/generational-compaction branch from 29f855a to 09e16b3 Compare July 23, 2025 15:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant