Skip to content

control plane & inferred schema improvements omnibus #2059

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 21 commits into from
May 2, 2025

Conversation

jgraettinger
Copy link
Member

@jgraettinger jgraettinger commented Apr 9, 2025

Description:

This PR includes a number of structured commits which collectively:

  • Make JSON Schema reductions aware of x-collection-generation-id, including reset upon a ratchet upwards.
  • Fix inferred schema log sites to include x-collection-generation-id consistently, and to discard cached inferred doc::Shapes of collections which have been reset or backfilled.
  • Switch collections using schema inference to flow://relaxed-write-schema (now supported in the UI).
  • Deeply refactor how validations work:
    • Previously, validation only considered a proposed "draft" state and would either succeed or error.
    • Now, validation jointly walks live models, live built specifications, and the proposed draft change.
    • In many cases validations will continue to error, however:
    • In some cases, validations will apply automatic fixes to draft models to enforce a constraint instead of error-ing.
    • This now gives us a single code path to apply nuanced and contextual handling of most constraint violations in the control plane.

Model fixes are included in publication_specs details and visible in the History view (though formatting could be better...).

Validations now fully supports recently-introduced spec fields such as inactive_bindings and inactive_transforms, and also honors collection reset semantics by performing an effective drop-and-replace.

This PR completes the control-plane and runtime implementation of changes from the collection evolution proposal.

One notable departure is that resource path pointers are de-emphasized and may soon be deprecated entirely. Instead, validations inline a /_meta/path property into each binding resource config which holds the last-Validated resource_path returned by the connector. This is sufficient for the new joins done by validations, but places fewer constraints or assumptions on how connectors compute binding resource paths.

A wide variety of scenarios were manually tested on a local stack, including:

  • Behaviors around resets of type-A and type-B collections bound to both captures and materializations, including bindings which are active and also disabled at the time of reset.
  • Behaviors around deletions of collections which had been actively bound to a capture & materialization.
  • Updates of inferred schemas, including around collection reset.

Workflow steps:

  • Many conditions which previous resulted in a validation error will now succeed, with additional changes made by the control plane and recorded in publication_specs.
  • Inferred schemas respect collection deletion / re-creation and reset.
  • Capture and materialization bindings will properly backfill upon collection reset, even if disabled during the reset and later enabled.

Documentation links affected:

(list any documentation links that you created, or existing ones that you've identified as needing updates, along with a brief description)

Notes for reviewers:

(anything that might help someone review this PR)


This change is Reviewable

@jgraettinger jgraettinger changed the title Johnny/validation tweaks 2 control plane & schema improvements omnibus Apr 9, 2025
@jgraettinger jgraettinger changed the title control plane & schema improvements omnibus control plane & inferred schema improvements omnibus Apr 9, 2025
@jgraettinger jgraettinger requested a review from psFried April 9, 2025 23:59
@jgraettinger jgraettinger marked this pull request as ready for review April 9, 2025 23:59
@jgraettinger
Copy link
Member Author

jgraettinger commented Apr 10, 2025

This PR includes both runtime and control-plane changes. I tested them together, but they can be released separately and in any order.

One other call-out is that the release of the inferred schema reducer update will churn a lot of inferred schemas in the platform, by adding x-collection-generation-id for the first time.

@jgraettinger jgraettinger added change:significant This is a significant change change:planned This is a planned change labels Apr 10, 2025
@jgraettinger
Copy link
Member Author

Issue #1969
Issue #1919

jgraettinger and others added 20 commits May 2, 2025 14:02
Generation ID ratchets (only) upwards during reduction. If the RHS has a
greater generation ID, then it replaces the LHS. Or, if the RHS has an
older one, it's understood to be stale and is dropped. If they're equal,
then reduce as per usual.

If only one side has a valid (string) generation ID, then the values are
reduced but the generation ID is explicitly attached to the reduced
output, which handles historical schemas without a generation ID, as
well as failing "safe" if there's a future whoopsie with regard to
x-collection-generation-ID population.
We preserve inferred shapes across capture task terms,
which is desireable to reduce the volumne of inferred schema update
logs, but in doing so we must properly handle cases where a collection
is reset (it's generation ID changes), or is backfilled.

Use the partition template name AND the state key to key the cache of
inferred shapes used across restarts.

Also fix derivation inferred schema logging, which had previously
omitted `x-collection-generation-id`.
This field has never been produced in actual logs.
It's harmless that it's been in the rollup (always coalesced
to the zero ID), but this commit cleans it up.
Validations have historically only looked at a proposed "draft" model,
and concerned themselves with either accepting the draft and building specs,
or producing errors.

This change deeply reworks the fundamental "job" of validation.

Going forward, it's job is to jointly examine live models, live specs,
and proposed draft models to validate all required constraints, and to
enforce those constraints by either a) producing an error, or b)
performing an automatic fix of the model, recording the fact that it
did so.

Much of the lift here is in the coordination required to jointly step
through live and drafted models and specs, which requires joining over
extracted resource paths.

Connector resource paths are somewhat complicated. After exhausting a
number of other unwork-able options, the solution implemented here
stashes a Validated response resource path into the resource config
model of its applicable bindings. Then, the next time we're validating a
proposed draft change of the now-live binding, we're able to extract
those stashed paths in order to properly join over them.

Having done this, new "fixes" are introduced for handling:
 - Unchanged bindings of collections which have been deleted.
 - Unchanged materialization exclusions of fields which have been
   dropped.
 - Unchanged projections of collections where the schema location has
   been removed.
 - Updating the inline inferred schema of a collection's read schema,
   respecting the generation ID of the collection and inferred schema.
 - Support for automatically back-filling the capture or materialization
   binding of a "reset" collection (having a changed generation ID).

New validations which restrict changes to keys or logical partitions are
also introduced, which may change only if the collection is also being
reset.

Finally, validations begin tracking inactive historical bindings /
transforms of tasks, which are threaded into connector RPCs and are also
used for low-level determination of whether binding backfill is required.
From the resource config schema. This approach doesn't work,
existing many existing connectors have more complicated behaviors which
cannot be expressed through resource path pointers.
Removes updating of inferred schemas from the agent publisher, as that is now
fully handled by `validation`.
Updates controllers to include formatted model fixes as part of the publication
`detail` in the history.
Updates the `validation` crate to consult `onIncompatibleSchemaChange` when
materializations respond to collection resets. While technically a reset
doesn't necessarily acompany an incompatible _schema_ change, in practice it
frequently will. This is especially the case with inferred schemas, which get
reset to the placeholder when the collection is reset. And perhaps more
importantly, the use case for `onIncompatibleSchemaChange` is to allow users to
prevent automatic backfills, which they would expect to work with collection
resets.

The case of `onIncompatibleSchemaChange: abort` will result in a validation
error, just like it does for controllers. There's some further opportunities
for refactoring, related to the `disableTask` value, but I'm kicking the can on
that for the short term.
Updates the agent integration tests to use somewhat more realistic mocks of
connector RPCs instead of using `NoOpConnectors`. Recent changes to
`validation` have made it so `NoOpConnectors` will effectively prevent certain
validations from being run, including materialization model updates in response
to collection resets. This updates the publisher to add a `MakeConnectors`
trait, which is used to supply the correct `Connectors` impl during tests vs
runtime. A `TestConnectors` impl is introduced, which has bare-bones Validate
impls that are sufficient for the current test scenarios.

A future opportunity for refactoring would be to change how integration tests
mock out connector validationn failures and unsatisfiable constraints.  I'm
leaving that as-is for now in order to reduce scope.
Updates the agent to allow collection keys and logical partitioning to change
when the collection is being reset.  Ideally, we'd remove this check entirely
now that it's being handled by `validation`. But doing so would break the
existing `incompatibleCollections`, which is used by the UI to display the
schema evolution dialogue, and so I'm leaving it in for the time being.
Adds an integration test covering collection reset scenarios.  Also updates the
integration tests to account for the `x-collection-generation-id` being present
in inferred schemas.
Support collection resets in evolutions, to avoid re-creating collections with
new names.  This change is purely additive, and doesn't remove support for the
old renaming, since that's still in use by the UI.  A new `reset` boolean was
added, which the UI can start using in the future. If a collection is requested
to be reset, then it will reset the collection and increment the backfill
counters of all connected captures and materializations. Meaning this could be
used to support something like an end-to-end dataflow reset.
Previously, materialization controllers would respond to deleted collections by
disabling materialization bindings that reference them.  This hands off
responsibility for that to the `validation` crate instead of handling it
explicitly in the controllers.  There's no significant change in behavior from
a user perspective, but this just removes some duplication, and also uncovers
(and fixes) a sneaky bug in `tables::dependencies` with regard to deletions of
materialization source captures.

Most of the code here would likely be unnecessary if we move handling of
materialization source captures into `validation`. This seems like probably a
good idea, but I'm loath to add scope to an already massive change set, and
that change would more naturally go along with other planned changes to source
captures.
Fail builds when either a capture or materialization Validate response omits a
`resource_path` for any binding.  Resource paths are required in order to
properly match up drafted and live bindings, and there's no way for us to
properly validate a capture or materialization without them.

This includes a bunch of test updates to populate empty resource configs and
update snapshots to include the embedded resource paths.
Defer removal of live binding specs until later in the validation
process, to ensure we correctly track inactive bindings in the face of
binding and task disablement due to incompatible changes.

If a binding is missing its resource path, communicate through scopes
which binding(s) it is, so that connector developers can track down why
(as this is a connector bug).
We've gone full circle on handling resource paths, and intend to stop using
`resource_path_pointers` altogether.  This updates the discover merge logic to
use the resource paths embedded in the resource configs instead of extracting
the paths using `resource_path_pointers`. The pointers will still be used as a
fallback in the case that:

- the embedded path is missing from the live spec bindings
- the discover response does not include `resource_path`

In either of those cases, we'll log a warning so that we can identify specs and
connectors that need to be updated. The intent is to follow up later to remove
`resource_path_pointers` and make it an error if either the live spec or the
discover response is missing the resource path.
Supports the transition away from `resource_path_pointers` by allowing
connectors to omit them from their spec response. We may still need
`resource_path_pointers` avaialable after a connector was updated, since
we use them as a fallback in case the live spec omits them. So in the case
that a connector spec response omits them, we'll retain the existing value
in the database.
@psFried psFried force-pushed the johnny/validation-tweaks-2 branch from 0b17848 to 805f5b6 Compare May 2, 2025 18:04
Copy link
Member

@psFried psFried left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@psFried psFried merged commit 9fc2f52 into master May 2, 2025
5 checks passed
@psFried psFried deleted the johnny/validation-tweaks-2 branch May 2, 2025 19:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
change:planned This is a planned change change:significant This is a significant change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants