-
Notifications
You must be signed in to change notification settings - Fork 85
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ADAP-369] [Feature] Include all bytes processed in the adapter response #595
Comments
Thanks for opening this @abuckenheimer ! Agree with your assessment that this is because the adapter response only returns the bytes processed for the last step, and it does not consider each query that executed as a result of running the model. This is problematic when it appears to the user that it is counting everything. Although this may be surprising to many dbt-bigquery users, the current behavior is what we expect, so I relabeled this as a feature request / enhancement. I also re-titled accordingly. We could imagine some scenario where dbt was able to accumulate the bytes processed over multiple statements and then display them. This could be a useful proxy for how much it cost to run a particular model. There's a few main things that give me pause though:
But my dominating thought is that most cost is going to be incurred once a project is in production, and it seems crucial that teams have a method to monitor their costs outside of the CLI output of dbt. I'd rather analytics engineers be comfortable seeing their all-in costs (including consumption by end users like BI) than relying on the reported bytes during model development. Interested to hear your feedback -- I am overestimating the role of comprehensive cost observability (outside of the dbt output)? |
I agree I don't expect dbt console reporting to be comprehensive but its the most practical feedback loop you can have while developing. We ask our developers to come up with an estimate of cost in PR's when introducing new models or making changes and seeing the gb processed in the console is a really great starting point telling us is this going up if so by how much. We know this is just an estimate and misses the nuance you might get in running a model over a period of time and seeing client read patterns but at least it tells you within an order of magnitude what something might cost which is most of what matters in evaluating changes. Now that this just reports 0 you can't even start a discussion without bringing in other tools. |
I hear you about supporting a practical feedback loop and how the optimization in dbt-labs/dbt-bigquery#77 removed some information that was there previously (however imperfect it was). Although we aren't able to prioritize it ourselves at this time, we'd welcome a PR from a community member that provides a targeted, best-effort roll-up of bytes processed on a per-model basis. Labeling this accordingly as |
## Summary & Motivation Reverts #19646 and #19498. Although we were getting more metadata from dbt, this metadata was: - Prone to inaccuracy (e.g. see https://github.com/dbt-labs/dbt-bigquery/issues/602) - Caused pipelines to potentially double in duration, due to the abundancy of logs This metadata is easy to get if we introduce a wrapper (and then metadata retrieval will be square in the ownership of this integration). With that in mind, we don't need to rely on the metadata from `--debug`. So remove it. ## How I Tested These Changes pytest
## Summary & Motivation Reverts #19646 and #19498. Although we were getting more metadata from dbt, this metadata was: - Prone to inaccuracy (e.g. see https://github.com/dbt-labs/dbt-bigquery/issues/602) - Caused pipelines to potentially double in duration, due to the abundancy of logs This metadata is easy to get if we introduce a wrapper (and then metadata retrieval will be square in the ownership of this integration). With that in mind, we don't need to rely on the metadata from `--debug`. So remove it. ## How I Tested These Changes pytest
## Summary & Motivation Reverts #19646 and #19498. Although we were getting more metadata from dbt, this metadata was: - Prone to inaccuracy (e.g. see https://github.com/dbt-labs/dbt-bigquery/issues/602) - Caused pipelines to potentially double in duration, due to the abundancy of logs This metadata is easy to get if we introduce a wrapper (and then metadata retrieval will be square in the ownership of this integration). With that in mind, we don't need to rely on the metadata from `--debug`. So remove it. ## How I Tested These Changes pytest
These two issues also asked for all bytes processed to be included in the adapter response and
I closed them each in favor of this issue. |
This is an earlier issue report which is related to this request: |
I don't have the opportunity to submit a PR right now for this, but just chiming in as I am the original poster/opener for dbt-labs/dbt-bigquery#775. Understanding the practical challenges of doing so, I am still very interested in seeing this functionality included. |
I believe this affects packages like dbt elementary |
Is this a regression in a recent version of dbt-bigquery?
Current Behavior
Hey folks this is a bit of a convoluted "regression" so I'm happy to mark this as something else but following the merge of dbt-labs/dbt-bigquery#77 I noticed that the
bytes_processed
returned in the adapter response for models withcopy_partitions: true
is now 0. This kind of makes sense, bigquery copies are free so 0 bytes is kind of correct but its ultimately unexpected because the construction of the temporary table is ultimately not free.This is indicative of a larger problem with the adapter response in that it only returns the bytes processed for the last step, it does not consider every part of running the model. You can see this when
copy_partitions: false
, dbt is under reporting the cost by up to half of running incremental models because we big-query will charge you for both the creation of a temporary table and then again to read the temporary table for the merge query into the destination.Expected/Previous Behavior
given a similar config to:
On a dbt run I'd expect to see the bytes processed be equal to the amount of bytes that would be processed by running the
target/compiled/.../model_name.sql
file in the bigquery console, instead I see:Steps To Reproduce
see above
Relevant log output
No response
Environment
Additional Context
No response
The text was updated successfully, but these errors were encountered: