control plane & inferred schema improvements omnibus #2059

jgraettinger · 2025-04-09T18:02:05Z

Description:

This PR includes a number of structured commits which collectively:

Make JSON Schema reductions aware of x-collection-generation-id, including reset upon a ratchet upwards.
Fix inferred schema log sites to include x-collection-generation-id consistently, and to discard cached inferred doc::Shapes of collections which have been reset or backfilled.
Switch collections using schema inference to flow://relaxed-write-schema (now supported in the UI).
Deeply refactor how validations work:
- Previously, validation only considered a proposed "draft" state and would either succeed or error.
- Now, validation jointly walks live models, live built specifications, and the proposed draft change.
- In many cases validations will continue to error, however:
- In some cases, validations will apply automatic fixes to draft models to enforce a constraint instead of error-ing.
- This now gives us a single code path to apply nuanced and contextual handling of most constraint violations in the control plane.

Model fixes are included in publication_specs details and visible in the History view (though formatting could be better...).

Validations now fully supports recently-introduced spec fields such as inactive_bindings and inactive_transforms, and also honors collection reset semantics by performing an effective drop-and-replace.

This PR completes the control-plane and runtime implementation of changes from the collection evolution proposal.

One notable departure is that resource path pointers are de-emphasized and may soon be deprecated entirely. Instead, validations inline a /_meta/path property into each binding resource config which holds the last-Validated resource_path returned by the connector. This is sufficient for the new joins done by validations, but places fewer constraints or assumptions on how connectors compute binding resource paths.

A wide variety of scenarios were manually tested on a local stack, including:

Behaviors around resets of type-A and type-B collections bound to both captures and materializations, including bindings which are active and also disabled at the time of reset.
Behaviors around deletions of collections which had been actively bound to a capture & materialization.
Updates of inferred schemas, including around collection reset.

Workflow steps:

Many conditions which previous resulted in a validation error will now succeed, with additional changes made by the control plane and recorded in publication_specs.
Inferred schemas respect collection deletion / re-creation and reset.
Capture and materialization bindings will properly backfill upon collection reset, even if disabled during the reset and later enabled.

Documentation links affected:

(list any documentation links that you created, or existing ones that you've identified as needing updates, along with a brief description)

Notes for reviewers:

(anything that might help someone review this PR)

This change is

jgraettinger · 2025-04-10T00:00:06Z

This PR includes both runtime and control-plane changes. I tested them together, but they can be released separately and in any order.

One other call-out is that the release of the inferred schema reducer update will churn a lot of inferred schemas in the platform, by adding x-collection-generation-id for the first time.

jgraettinger · 2025-04-10T00:12:11Z

Issue #1969
Issue #1919

Generation ID ratchets (only) upwards during reduction. If the RHS has a greater generation ID, then it replaces the LHS. Or, if the RHS has an older one, it's understood to be stale and is dropped. If they're equal, then reduce as per usual. If only one side has a valid (string) generation ID, then the values are reduced but the generation ID is explicitly attached to the reduced output, which handles historical schemas without a generation ID, as well as failing "safe" if there's a future whoopsie with regard to x-collection-generation-ID population.

We preserve inferred shapes across capture task terms, which is desireable to reduce the volumne of inferred schema update logs, but in doing so we must properly handle cases where a collection is reset (it's generation ID changes), or is backfilled. Use the partition template name AND the state key to key the cache of inferred shapes used across restarts. Also fix derivation inferred schema logging, which had previously omitted `x-collection-generation-id`.

This field has never been produced in actual logs. It's harmless that it's been in the rollup (always coalesced to the zero ID), but this commit cleans it up.

Rather than flow://write-schema

Validations have historically only looked at a proposed "draft" model, and concerned themselves with either accepting the draft and building specs, or producing errors. This change deeply reworks the fundamental "job" of validation. Going forward, it's job is to jointly examine live models, live specs, and proposed draft models to validate all required constraints, and to enforce those constraints by either a) producing an error, or b) performing an automatic fix of the model, recording the fact that it did so. Much of the lift here is in the coordination required to jointly step through live and drafted models and specs, which requires joining over extracted resource paths. Connector resource paths are somewhat complicated. After exhausting a number of other unwork-able options, the solution implemented here stashes a Validated response resource path into the resource config model of its applicable bindings. Then, the next time we're validating a proposed draft change of the now-live binding, we're able to extract those stashed paths in order to properly join over them. Having done this, new "fixes" are introduced for handling: - Unchanged bindings of collections which have been deleted. - Unchanged materialization exclusions of fields which have been dropped. - Unchanged projections of collections where the schema location has been removed. - Updating the inline inferred schema of a collection's read schema, respecting the generation ID of the collection and inferred schema. - Support for automatically back-filling the capture or materialization binding of a "reset" collection (having a changed generation ID). New validations which restrict changes to keys or logical partitions are also introduced, which may change only if the collection is also being reset. Finally, validations begin tracking inactive historical bindings / transforms of tasks, which are threaded into connector RPCs and are also used for low-level determination of whether binding backfill is required.

From the resource config schema. This approach doesn't work, existing many existing connectors have more complicated behaviors which cannot be expressed through resource path pointers.

Removes updating of inferred schemas from the agent publisher, as that is now fully handled by `validation`.

Updates controllers to include formatted model fixes as part of the publication `detail` in the history.

Updates the `validation` crate to consult `onIncompatibleSchemaChange` when materializations respond to collection resets. While technically a reset doesn't necessarily acompany an incompatible _schema_ change, in practice it frequently will. This is especially the case with inferred schemas, which get reset to the placeholder when the collection is reset. And perhaps more importantly, the use case for `onIncompatibleSchemaChange` is to allow users to prevent automatic backfills, which they would expect to work with collection resets. The case of `onIncompatibleSchemaChange: abort` will result in a validation error, just like it does for controllers. There's some further opportunities for refactoring, related to the `disableTask` value, but I'm kicking the can on that for the short term.

Updates the agent integration tests to use somewhat more realistic mocks of connector RPCs instead of using `NoOpConnectors`. Recent changes to `validation` have made it so `NoOpConnectors` will effectively prevent certain validations from being run, including materialization model updates in response to collection resets. This updates the publisher to add a `MakeConnectors` trait, which is used to supply the correct `Connectors` impl during tests vs runtime. A `TestConnectors` impl is introduced, which has bare-bones Validate impls that are sufficient for the current test scenarios. A future opportunity for refactoring would be to change how integration tests mock out connector validationn failures and unsatisfiable constraints. I'm leaving that as-is for now in order to reduce scope.

Updates the agent to allow collection keys and logical partitioning to change when the collection is being reset. Ideally, we'd remove this check entirely now that it's being handled by `validation`. But doing so would break the existing `incompatibleCollections`, which is used by the UI to display the schema evolution dialogue, and so I'm leaving it in for the time being.

Adds an integration test covering collection reset scenarios. Also updates the integration tests to account for the `x-collection-generation-id` being present in inferred schemas.

Support collection resets in evolutions, to avoid re-creating collections with new names. This change is purely additive, and doesn't remove support for the old renaming, since that's still in use by the UI. A new `reset` boolean was added, which the UI can start using in the future. If a collection is requested to be reset, then it will reset the collection and increment the backfill counters of all connected captures and materializations. Meaning this could be used to support something like an end-to-end dataflow reset.

Previously, materialization controllers would respond to deleted collections by disabling materialization bindings that reference them. This hands off responsibility for that to the `validation` crate instead of handling it explicitly in the controllers. There's no significant change in behavior from a user perspective, but this just removes some duplication, and also uncovers (and fixes) a sneaky bug in `tables::dependencies` with regard to deletions of materialization source captures. Most of the code here would likely be unnecessary if we move handling of materialization source captures into `validation`. This seems like probably a good idea, but I'm loath to add scope to an already massive change set, and that change would more naturally go along with other planned changes to source captures.

Fail builds when either a capture or materialization Validate response omits a `resource_path` for any binding. Resource paths are required in order to properly match up drafted and live bindings, and there's no way for us to properly validate a capture or materialization without them. This includes a bunch of test updates to populate empty resource configs and update snapshots to include the embedded resource paths.

Defer removal of live binding specs until later in the validation process, to ensure we correctly track inactive bindings in the face of binding and task disablement due to incompatible changes. If a binding is missing its resource path, communicate through scopes which binding(s) it is, so that connector developers can track down why (as this is a connector bug).

We've gone full circle on handling resource paths, and intend to stop using `resource_path_pointers` altogether. This updates the discover merge logic to use the resource paths embedded in the resource configs instead of extracting the paths using `resource_path_pointers`. The pointers will still be used as a fallback in the case that: - the embedded path is missing from the live spec bindings - the discover response does not include `resource_path` In either of those cases, we'll log a warning so that we can identify specs and connectors that need to be updated. The intent is to follow up later to remove `resource_path_pointers` and make it an error if either the live spec or the discover response is missing the resource path.

Supports the transition away from `resource_path_pointers` by allowing connectors to omit them from their spec response. We may still need `resource_path_pointers` avaialable after a connector was updated, since we use them as a fallback in case the live spec omits them. So in the case that a connector spec response omits them, we'll retain the existing value in the database.

psFried

LGTM

jgraettinger changed the title ~~Johnny/validation tweaks 2~~ control plane & schema improvements omnibus Apr 9, 2025

jgraettinger changed the title ~~control plane & schema improvements omnibus~~ control plane & inferred schema improvements omnibus Apr 9, 2025

jgraettinger requested a review from psFried April 9, 2025 23:59

jgraettinger marked this pull request as ready for review April 9, 2025 23:59

jgraettinger added change:significant This is a significant change change:planned This is a planned change labels Apr 10, 2025

williamhbaker mentioned this pull request Apr 25, 2025

materializations: handle transitions for bindings that are being re-enabled estuary/connectors#2722

Merged

psFried mentioned this pull request Apr 29, 2025

Phil/validation tweaks tweaks #2115

Merged

jgraettinger and others added 20 commits May 2, 2025 14:02

ops-catalog: remove spurious collection_generation_id field

505e1d6

This field has never been produced in actual logs. It's harmless that it's been in the rollup (always coalesced to the zero ID), but this commit cleans it up.

models: switch inference to use flow://relaxed-write-schema

de7e69f

Rather than flow://write-schema

agent: fix typo in update_l2_reporting diff generation

dc11a59

agent: revert extracting materialization resource path pointers

9433313

From the resource config schema. This approach doesn't work, existing many existing connectors have more complicated behaviors which cannot be expressed through resource path pointers.

agent: fix schema evolution test for new validations behavior

3a158b1

agent: include formatted model fixes in publication_details

64f5dda

agent: stop updating inferred schemas in publisher

27f43e1

Removes updating of inferred schemas from the agent publisher, as that is now fully handled by `validation`.

agent: account for model fixes in controller publication history

2742634

Updates controllers to include formatted model fixes as part of the publication `detail` in the history.

agent: add collection_resets integration test

bdb0f34

Adds an integration test covering collection reset scenarios. Also updates the integration tests to account for the `x-collection-generation-id` being present in inferred schemas.

psFried force-pushed the johnny/validation-tweaks-2 branch from 0b17848 to 805f5b6 Compare May 2, 2025 18:04

psFried approved these changes May 2, 2025

View reviewed changes

psFried merged commit 9fc2f52 into master May 2, 2025
5 checks passed

psFried deleted the johnny/validation-tweaks-2 branch May 2, 2025 19:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

control plane & inferred schema improvements omnibus #2059

control plane & inferred schema improvements omnibus #2059

Uh oh!

jgraettinger commented Apr 9, 2025 •

edited

Loading

Uh oh!

jgraettinger commented Apr 10, 2025 •

edited

Loading

Uh oh!

jgraettinger commented Apr 10, 2025

Uh oh!

psFried left a comment

Uh oh!

Uh oh!

Uh oh!

control plane & inferred schema improvements omnibus #2059

control plane & inferred schema improvements omnibus #2059

Uh oh!

Conversation

jgraettinger commented Apr 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jgraettinger commented Apr 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jgraettinger commented Apr 10, 2025

Uh oh!

psFried left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

jgraettinger commented Apr 9, 2025 •

edited

Loading

jgraettinger commented Apr 10, 2025 •

edited

Loading