Blueprint Rendezvous Table Garbage Collection

# Background

Blueprints propagate some information to the database using rendezvous tables, as described in https://rfd.shared.oxide.computer/rfd/541.

This is implemented in `nexus/src/app/background/tasks/blueprint_rendezvous.rs` as a background task, which invokes `reconcile_blueprint_rendezvous_tables`.

In short, this operation is:

- Read latest blueprint
- Read latest inventory
- "Do Reconciliation", which means many calls to `INSERT INTO my_rendezvous_table`, and either ignoring conflict errors or explicitly ignoring them.

As implemented today (1/5/2026) we are not performing "hard deletion" of any of these rendezvous table rows.

# The Issue

If we want to ever perform hard deletion of any rendezvous table rows - suppose they're extremely old, no longer relevant, taking up space, etc - it's hard to guarantee that the background task won't re-insert old data.

For example, suppose we have the following sequence of events:

- Nexus A reads blueprint @ generation 1, gets ready to do reconciliation (which will insert into a rendezvous table with id = `foo-bar-baz`)
- Nexus A hangs unexpectedly. We'll treat this as a long sleep
- Nexus B creates, reads, and executes many more blueprints. The database row with id = `foo-bar-baz` is created, and later we want to hard delete it.
- At some point in the future, Nexus A will wake back up and resume reconciliation. It'll try to insert the entry with id = `foo-bar-baz`. If this row is hard-deleted, the later `INSERT` operation would succeed (which would be a bug - this would be a "zombie record" coming back to life unexpectedly after deletion).

# Resolution

There are a couple ways we could mitigate this

1. **Guard the INSERT**. Using a generation number stored somewhere, convert the `INSERT INTO` operations into a transaction or CTE, which is effectively `INSERT INTO ... + ONLY DO IT WHILE THE GENERATION NUMBER OF RENDEZVOUS IS LATEST`. This should cause all old rendezvous operations to be "locked out".
2. **Track ongoing rendezvous operations, Guard the DELETE**. Basically: If we know what rendezvous operations are going on, we could prevent the "hard deletion" from happening until we know it won't be revived.
3. **Rely on timeouts**. Use timeouts, to make the assumption that "no really old Nexuses exist". E.g., if rendezvous operations have a timeout of 30 sec, but we only perform hard deletions after 24 hours. (Note: this has huge issues for e.g. interactions with mupdate, time sync, so we'd *prefer* to avoid it, but listing it for completeness)

(My personal preference is for option 1 - it does have slightly more boilerplate, but it seems most "clearly correct", and it really limits the duration of old rendezvous operations happening)


With the ongoing work in https://github.com/oxidecomputer/omicron/pull/9552 , this is relevant to fault management reconciliation as well. Frankly it might be **more** relevant there, because alerts are probably going to churn more frequently than our blueprint rendezvous tables (e.g. dataset configurations).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Blueprint Rendezvous Table Garbage Collection #9592

Background

The Issue

Resolution

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Blueprint Rendezvous Table Garbage Collection #9592

Description

Background

The Issue

Resolution

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions