Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ADAP-369] [Feature] Include all bytes processed in the adapter response #595

Open
2 tasks done
abuckenheimer opened this issue Mar 14, 2023 · 7 comments
Open
2 tasks done
Labels
feature:cost-reduction Issues related to measuring and reducing execution cost help-wanted Extra attention is needed pkg:dbt-bigquery Issue affects dbt-bigquery type:enhancement New feature request

Comments

@abuckenheimer
Copy link

Is this a regression in a recent version of dbt-bigquery?

  • I believe this is a regression in dbt-bigquery functionality
  • I have searched the existing issues, and I could not find an existing issue for this regression

Current Behavior

Hey folks this is a bit of a convoluted "regression" so I'm happy to mark this as something else but following the merge of dbt-labs/dbt-bigquery#77 I noticed that the bytes_processed returned in the adapter response for models with copy_partitions: true is now 0. This kind of makes sense, bigquery copies are free so 0 bytes is kind of correct but its ultimately unexpected because the construction of the temporary table is ultimately not free.

This is indicative of a larger problem with the adapter response in that it only returns the bytes processed for the last step, it does not consider every part of running the model. You can see this when copy_partitions: false, dbt is under reporting the cost by up to half of running incremental models because we big-query will charge you for both the creation of a temporary table and then again to read the temporary table for the merge query into the destination.

Expected/Previous Behavior

given a similar config to:

model_name:
    +materialized: incremental
    +partition_by:
              copy_partitions: 
              field: timestamp
              data_type: timestamp
              granularity: day

On a dbt run I'd expect to see the bytes processed be equal to the amount of bytes that would be processed by running the target/compiled/.../model_name.sql file in the bigquery console, instead I see:

19:20:57  1 of 1 OK created sql incremental model dataset.model_name................ [None (0 processed) in 166.72s]

Steps To Reproduce

see above

Relevant log output

No response

Environment

- OS:Ubuntu 18.04.6 LTS (Bionic Beaver)
- Python: 3.11.2
- dbt-core (working version): N/A
- dbt-bigquery (working version): N/A 
- dbt-core (regression version): 1.5.0-b3
- dbt-bigquery (regression version): 1.5.0b2

Additional Context

No response

@abuckenheimer abuckenheimer added type:bug Something isn't working as documented type:regression Something used to work and is no longer working triage:product In Product's queue labels Mar 14, 2023
@github-actions github-actions bot changed the title [Regression] 0 bytes processed in adapter response [ADAP-369] [Regression] 0 bytes processed in adapter response Mar 14, 2023
@dbeatty10 dbeatty10 added type:enhancement New feature request and removed type:bug Something isn't working as documented type:regression Something used to work and is no longer working labels Mar 15, 2023
@dbeatty10 dbeatty10 self-assigned this Mar 15, 2023
@dbeatty10
Copy link
Contributor

Thanks for opening this @abuckenheimer !

Agree with your assessment that this is because the adapter response only returns the bytes processed for the last step, and it does not consider each query that executed as a result of running the model. This is problematic when it appears to the user that it is counting everything.

Although this may be surprising to many dbt-bigquery users, the current behavior is what we expect, so I relabeled this as a feature request / enhancement. I also re-titled accordingly.

We could imagine some scenario where dbt was able to accumulate the bytes processed over multiple statements and then display them. This could be a useful proxy for how much it cost to run a particular model.

There's a few main things that give me pause though:

  1. The output emitted by dbt doesn't feel like the best place for comprehensive cost tracking and management
  2. Not all cost-generating activity is likely to be easily captured and assigned to an individual dbt model -- think API calls or necessary metadata queries.
  3. We typically prefer to expose features that are supported on multiple data platforms (e.g., not just a single platform like BigQuery)

bytes_processed in dbt-bigquery is already an exception to the norm of only exposing features supported by multiple warehouses, and we could push further down that path by trying to do an accumulation / roll-up of the reported value.

But my dominating thought is that most cost is going to be incurred once a project is in production, and it seems crucial that teams have a method to monitor their costs outside of the CLI output of dbt. I'd rather analytics engineers be comfortable seeing their all-in costs (including consumption by end users like BI) than relying on the reported bytes during model development.

Interested to hear your feedback -- I am overestimating the role of comprehensive cost observability (outside of the dbt output)?

@dbeatty10 dbeatty10 removed their assignment Mar 15, 2023
@dbeatty10 dbeatty10 added triage:awaiting-response Awaiting a response from the reporter and removed triage:product In Product's queue labels Mar 15, 2023
@dbeatty10 dbeatty10 changed the title [ADAP-369] [Regression] 0 bytes processed in adapter response [ADAP-369] [Feature] Include all bytes processed in the adapter response Mar 15, 2023
@abuckenheimer
Copy link
Author

I agree I don't expect dbt console reporting to be comprehensive but its the most practical feedback loop you can have while developing. We ask our developers to come up with an estimate of cost in PR's when introducing new models or making changes and seeing the gb processed in the console is a really great starting point telling us is this going up if so by how much. We know this is just an estimate and misses the nuance you might get in running a model over a period of time and seeing client read patterns but at least it tells you within an order of magnitude what something might cost which is most of what matters in evaluating changes. Now that this just reports 0 you can't even start a discussion without bringing in other tools.

@github-actions github-actions bot added triage:product In Product's queue and removed triage:awaiting-response Awaiting a response from the reporter labels Mar 15, 2023
@dbeatty10
Copy link
Contributor

I hear you about supporting a practical feedback loop and how the optimization in dbt-labs/dbt-bigquery#77 removed some information that was there previously (however imperfect it was).

Although we aren't able to prioritize it ourselves at this time, we'd welcome a PR from a community member that provides a targeted, best-effort roll-up of bytes processed on a per-model basis. Labeling this accordingly as help_wanted.

@dbeatty10 dbeatty10 added help-wanted Extra attention is needed and removed triage:product In Product's queue labels Mar 15, 2023
rexledesma referenced this issue in dagster-io/dagster Mar 22, 2024
## Summary & Motivation
Reverts #19646 and
#19498.

Although we were getting more metadata from dbt, this metadata was:
- Prone to inaccuracy (e.g. see
https://github.com/dbt-labs/dbt-bigquery/issues/602)
- Caused pipelines to potentially double in duration, due to the
abundancy of logs

This metadata is easy to get if we introduce a wrapper (and then
metadata retrieval will be square in the ownership of this integration).
With that in mind, we don't need to rely on the metadata from `--debug`.
So remove it.

## How I Tested These Changes
pytest
PedramNavid referenced this issue in dagster-io/dagster Mar 28, 2024
## Summary & Motivation
Reverts #19646 and
#19498.

Although we were getting more metadata from dbt, this metadata was:
- Prone to inaccuracy (e.g. see
https://github.com/dbt-labs/dbt-bigquery/issues/602)
- Caused pipelines to potentially double in duration, due to the
abundancy of logs

This metadata is easy to get if we introduce a wrapper (and then
metadata retrieval will be square in the ownership of this integration).
With that in mind, we don't need to rely on the metadata from `--debug`.
So remove it.

## How I Tested These Changes
pytest
rexledesma referenced this issue in dagster-io/dagster Mar 29, 2024
## Summary & Motivation
Reverts #19646 and
#19498.

Although we were getting more metadata from dbt, this metadata was:
- Prone to inaccuracy (e.g. see
https://github.com/dbt-labs/dbt-bigquery/issues/602)
- Caused pipelines to potentially double in duration, due to the
abundancy of logs

This metadata is easy to get if we introduce a wrapper (and then
metadata retrieval will be square in the ownership of this integration).
With that in mind, we don't need to rely on the metadata from `--debug`.
So remove it.

## How I Tested These Changes
pytest
@dbeatty10
Copy link
Contributor

These two issues also asked for all bytes processed to be included in the adapter response and run_results.json:

I closed them each in favor of this issue.

@richardbiddle
Copy link

richardbiddle commented May 17, 2024

This is an earlier issue report which is related to this request:
CT-1107 Log output for materialization incremental incorrectly reports bytes processed

@dbeatty10 dbeatty10 added the feature:cost-reduction Issues related to measuring and reducing execution cost label May 30, 2024
@codigo-ergo-sum
Copy link

I don't have the opportunity to submit a PR right now for this, but just chiming in as I am the original poster/opener for dbt-labs/dbt-bigquery#775. Understanding the practical challenges of doing so, I am still very interested in seeing this functionality included.

@mikealfare mikealfare added the pkg:dbt-bigquery Issue affects dbt-bigquery label Jan 14, 2025
@mikealfare mikealfare transferred this issue from dbt-labs/dbt-bigquery Jan 14, 2025
@Thrasi
Copy link

Thrasi commented Jan 30, 2025

I believe this affects packages like dbt elementary
which use the run_results.json to keep track of model performance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature:cost-reduction Issues related to measuring and reducing execution cost help-wanted Extra attention is needed pkg:dbt-bigquery Issue affects dbt-bigquery type:enhancement New feature request
Projects
None yet
Development

No branches or pull requests

6 participants