Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug/Enhancement] Microbatch models shouldn't block the main thread in multi-threaded dbt runs. #11243

Open
1 task done
QMalcolm opened this issue Jan 27, 2025 · 0 comments
Open
1 task done
Labels
backport 1.9.latest bug Something isn't working microbatch Issues related to the microbatch incremental strategy

Comments

@QMalcolm
Copy link
Contributor

Housekeeping

  • I am a maintainer of dbt-core

Short description

Microbatch models currently block the main thread when running dbt with multiple threads. This affects microbatch models that have their batches run concurrently or sequentially, but has greater impacts when the batches are being run sequentially. In essence, the issue is that the scheduling of batch execution for a microbatch model happens on the main thread. This means that the scheduling of other models becomes blocked until the microbatch model is complete.

For example, consider we have a microbatch model in our project. The microbatch model has many sibling/cousin/nibling/etc nodes. Our dbt project is configured to have multiple threads. Once the main thread reaches the microbatch model, work already being done on worker threads for other nodes will continue. However, no new worker threads for other nodes will be spun up as this scheduling is handled by the main thread, which is now blocked by the microbatch model. When the batches for microbatch models are run sequentially, this means all the batches are being run on the main thread, basically reducing the dbt invocation to being single threaded until the completion of the microbatch model. When batches for microbatch models are run concurrently, worker threads will be spun up for the batches. However, the main thread is still blocked by the microbatch model until all the batches have completed. In this case, the worker threads will be saturated for the most part. However, if there is one long-running batch where all the other work thread batches have finished, the remaining threads will sit idle until the last batch finishes for the microbatch model.

We should instead delegate the scheduling of microbatch batches to a worker thread itself. This would mean that when batches are being run sequentially, only one worker thread would be occupied, and would not block the main thread. When batches are being run concurrently, one worker thread will do the scheduling and saturate the other worker threads with batches. This does mean that one worker thread won't be doing batch work, which isn't great. However, if there are some long-running batches that are holding only a subset of the worker threads, then main thread can continue to allocate other nodes to the freed up worker nodes.

Acceptance criteria

  • microbatch batch execution doesn't block the main thread in multi-threaded dbt environments

Suggested Tests

  • microbatch batch execution doesn't block the main thread in multi-threaded dbt environments

Impact to Other Teams

N/A

Will backports be required?

Possibly 1.9, although this might be a large enough change that it isn't safe to do so.

Context

No response

@QMalcolm QMalcolm added bug Something isn't working enhancement New feature or request microbatch Issues related to the microbatch incremental strategy labels Jan 27, 2025
@graciegoheen graciegoheen added backport 1.9.latest and removed enhancement New feature or request labels Jan 30, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport 1.9.latest bug Something isn't working microbatch Issues related to the microbatch incremental strategy
Projects
None yet
Development

No branches or pull requests

2 participants