-
Notifications
You must be signed in to change notification settings - Fork 48
Workflow streams #31
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Workflow streams #31
Conversation
|
Exciting! |
|
@aaronsteers Now just gotta work around the rate limiting 🙃 |
| def __init__(self, *args, **kwargs): | ||
| super().__init__(*args, **kwargs) | ||
| self._schema_emitted = False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@aaronsteers I also made schema messages a bit quieter for jobs. It was outputting one every time a new parent workflow run was synced, and some targets flush records when they receive a new schema for an existing table, which would cause the target to write lots of really small batches and affect performance.
Maybe this makes sense as a builtin option in the SDK?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, @edgarrmondragon - We certainly did see target performance issues due to flushing every time a schema message was received in target-athena (and other community-created targets). These were addressed at the target level within the SDK, adding a schema diff check before draining records, and other changes to just reduce the frequency of those operations.
And yes, I do think something like a _schema_emitted tracker is good to add in the SDK at the tap level also. However, we may need to check schema messages against each other in case the schema has indeed been changed since the last iteration, and also this may need to be coordinated across multiple instances of stream objects having the same type. (For parent-child streams, where this is the biggest issue, I believe the next child stream instance may be a brand new object of the same type.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@aaronsteers I agree that just ignoring repeated schemas is not ideal, there may be some evolution from one to the next 👍.
For parent-child streams, where this is the biggest issue, I believe the next child stream instance may be a brand new object of the same type.
For the built-in parent-child functionality, child streams are actually a single instance created at Tap.discover_streams, but synced with multiple contexts so having _schema_emitted at the class level works.
|
@edgarrmondragon is this still in progress? What's blocking? |
@ericboucher thanks for the ping. I had forgotten about this one but last time tests were failing due to rate limits. I just rebased and if tests are ✅ , this might be ready for review 😄 |
|
SonarCloud Quality Gate failed.
|
|
Looks good to me! Same blocker as #93 for SonarCloud tho... |
|
@edgarrmondragon I think you can go ahead and merge |








New streams for GitHub Actions :
workflows: All workflows in a repository.workflow_runs: All workflow runs in a repository.workflow_run_jobs: All jobs executed in a workflow run. For example, every value in a matrix corresponds to a job.