-
Notifications
You must be signed in to change notification settings - Fork 66
control plane & inferred schema improvements omnibus #2059
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This PR includes both runtime and control-plane changes. I tested them together, but they can be released separately and in any order. One other call-out is that the release of the inferred schema reducer update will churn a lot of inferred schemas in the platform, by adding |
Generation ID ratchets (only) upwards during reduction. If the RHS has a greater generation ID, then it replaces the LHS. Or, if the RHS has an older one, it's understood to be stale and is dropped. If they're equal, then reduce as per usual. If only one side has a valid (string) generation ID, then the values are reduced but the generation ID is explicitly attached to the reduced output, which handles historical schemas without a generation ID, as well as failing "safe" if there's a future whoopsie with regard to x-collection-generation-ID population.
We preserve inferred shapes across capture task terms, which is desireable to reduce the volumne of inferred schema update logs, but in doing so we must properly handle cases where a collection is reset (it's generation ID changes), or is backfilled. Use the partition template name AND the state key to key the cache of inferred shapes used across restarts. Also fix derivation inferred schema logging, which had previously omitted `x-collection-generation-id`.
This field has never been produced in actual logs. It's harmless that it's been in the rollup (always coalesced to the zero ID), but this commit cleans it up.
Rather than flow://write-schema
Validations have historically only looked at a proposed "draft" model, and concerned themselves with either accepting the draft and building specs, or producing errors. This change deeply reworks the fundamental "job" of validation. Going forward, it's job is to jointly examine live models, live specs, and proposed draft models to validate all required constraints, and to enforce those constraints by either a) producing an error, or b) performing an automatic fix of the model, recording the fact that it did so. Much of the lift here is in the coordination required to jointly step through live and drafted models and specs, which requires joining over extracted resource paths. Connector resource paths are somewhat complicated. After exhausting a number of other unwork-able options, the solution implemented here stashes a Validated response resource path into the resource config model of its applicable bindings. Then, the next time we're validating a proposed draft change of the now-live binding, we're able to extract those stashed paths in order to properly join over them. Having done this, new "fixes" are introduced for handling: - Unchanged bindings of collections which have been deleted. - Unchanged materialization exclusions of fields which have been dropped. - Unchanged projections of collections where the schema location has been removed. - Updating the inline inferred schema of a collection's read schema, respecting the generation ID of the collection and inferred schema. - Support for automatically back-filling the capture or materialization binding of a "reset" collection (having a changed generation ID). New validations which restrict changes to keys or logical partitions are also introduced, which may change only if the collection is also being reset. Finally, validations begin tracking inactive historical bindings / transforms of tasks, which are threaded into connector RPCs and are also used for low-level determination of whether binding backfill is required.
From the resource config schema. This approach doesn't work, existing many existing connectors have more complicated behaviors which cannot be expressed through resource path pointers.
Removes updating of inferred schemas from the agent publisher, as that is now fully handled by `validation`.
Updates controllers to include formatted model fixes as part of the publication `detail` in the history.
Updates the `validation` crate to consult `onIncompatibleSchemaChange` when materializations respond to collection resets. While technically a reset doesn't necessarily acompany an incompatible _schema_ change, in practice it frequently will. This is especially the case with inferred schemas, which get reset to the placeholder when the collection is reset. And perhaps more importantly, the use case for `onIncompatibleSchemaChange` is to allow users to prevent automatic backfills, which they would expect to work with collection resets. The case of `onIncompatibleSchemaChange: abort` will result in a validation error, just like it does for controllers. There's some further opportunities for refactoring, related to the `disableTask` value, but I'm kicking the can on that for the short term.
Updates the agent integration tests to use somewhat more realistic mocks of connector RPCs instead of using `NoOpConnectors`. Recent changes to `validation` have made it so `NoOpConnectors` will effectively prevent certain validations from being run, including materialization model updates in response to collection resets. This updates the publisher to add a `MakeConnectors` trait, which is used to supply the correct `Connectors` impl during tests vs runtime. A `TestConnectors` impl is introduced, which has bare-bones Validate impls that are sufficient for the current test scenarios. A future opportunity for refactoring would be to change how integration tests mock out connector validationn failures and unsatisfiable constraints. I'm leaving that as-is for now in order to reduce scope.
Updates the agent to allow collection keys and logical partitioning to change when the collection is being reset. Ideally, we'd remove this check entirely now that it's being handled by `validation`. But doing so would break the existing `incompatibleCollections`, which is used by the UI to display the schema evolution dialogue, and so I'm leaving it in for the time being.
Adds an integration test covering collection reset scenarios. Also updates the integration tests to account for the `x-collection-generation-id` being present in inferred schemas.
Support collection resets in evolutions, to avoid re-creating collections with new names. This change is purely additive, and doesn't remove support for the old renaming, since that's still in use by the UI. A new `reset` boolean was added, which the UI can start using in the future. If a collection is requested to be reset, then it will reset the collection and increment the backfill counters of all connected captures and materializations. Meaning this could be used to support something like an end-to-end dataflow reset.
Previously, materialization controllers would respond to deleted collections by disabling materialization bindings that reference them. This hands off responsibility for that to the `validation` crate instead of handling it explicitly in the controllers. There's no significant change in behavior from a user perspective, but this just removes some duplication, and also uncovers (and fixes) a sneaky bug in `tables::dependencies` with regard to deletions of materialization source captures. Most of the code here would likely be unnecessary if we move handling of materialization source captures into `validation`. This seems like probably a good idea, but I'm loath to add scope to an already massive change set, and that change would more naturally go along with other planned changes to source captures.
Fail builds when either a capture or materialization Validate response omits a `resource_path` for any binding. Resource paths are required in order to properly match up drafted and live bindings, and there's no way for us to properly validate a capture or materialization without them. This includes a bunch of test updates to populate empty resource configs and update snapshots to include the embedded resource paths.
Defer removal of live binding specs until later in the validation process, to ensure we correctly track inactive bindings in the face of binding and task disablement due to incompatible changes. If a binding is missing its resource path, communicate through scopes which binding(s) it is, so that connector developers can track down why (as this is a connector bug).
We've gone full circle on handling resource paths, and intend to stop using `resource_path_pointers` altogether. This updates the discover merge logic to use the resource paths embedded in the resource configs instead of extracting the paths using `resource_path_pointers`. The pointers will still be used as a fallback in the case that: - the embedded path is missing from the live spec bindings - the discover response does not include `resource_path` In either of those cases, we'll log a warning so that we can identify specs and connectors that need to be updated. The intent is to follow up later to remove `resource_path_pointers` and make it an error if either the live spec or the discover response is missing the resource path.
Supports the transition away from `resource_path_pointers` by allowing connectors to omit them from their spec response. We may still need `resource_path_pointers` avaialable after a connector was updated, since we use them as a fallback in case the live spec omits them. So in the case that a connector spec response omits them, we'll retain the existing value in the database.
0b17848
to
805f5b6
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Description:
This PR includes a number of structured commits which collectively:
x-collection-generation-id
, including reset upon a ratchet upwards.x-collection-generation-id
consistently, and to discard cached inferred doc::Shapes of collections which have been reset or backfilled.flow://relaxed-write-schema
(now supported in the UI).Model fixes are included in
publication_specs
details and visible in the History view (though formatting could be better...).Validations now fully supports recently-introduced spec fields such as
inactive_bindings
andinactive_transforms
, and also honors collectionreset
semantics by performing an effective drop-and-replace.This PR completes the control-plane and runtime implementation of changes from the collection evolution proposal.
One notable departure is that resource path pointers are de-emphasized and may soon be deprecated entirely. Instead, validations inline a
/_meta/path
property into each binding resource config which holds the last-Validated resource_path returned by the connector. This is sufficient for the new joins done by validations, but places fewer constraints or assumptions on how connectors compute binding resource paths.A wide variety of scenarios were manually tested on a local stack, including:
Workflow steps:
publication_specs
.reset
.Documentation links affected:
(list any documentation links that you created, or existing ones that you've identified as needing updates, along with a brief description)
Notes for reviewers:
(anything that might help someone review this PR)
This change is