-
Notifications
You must be signed in to change notification settings - Fork 14.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[AIRFLOW-249] Refactor the SLA mechanism (Continuation from #3584 ) #8545
base: main
Are you sure you want to change the base?
[AIRFLOW-249] Refactor the SLA mechanism (Continuation from #3584 ) #8545
Conversation
Congratulations on your first Pull Request and welcome to the Apache Airflow community! If you have any issues or are unsure about any anything please check our Contribution Guide (https://github.com/apache/airflow/blob/master/CONTRIBUTING.rst)
|
223783d
to
261601a
Compare
064de0a
to
62d8abd
Compare
Thank you for taking this on. I couldn't justify any more time on it but I think it's still very relevant to the project. |
Happy to contribute :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A very big change, lots to take in. Made a first round of comments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a5e3757
to
819c445
Compare
Addressed most, if not all comments in the previous round of review. Played a bit with two cases regarding how we fetch DagRuns for SLA consideration:
My conclusion is that option 1 is a better trade-off, because one has to go through all TIs in a DagRun to determine if a DR can be free from further checking (e.g., if a DR has 10 TIs, then each TI has to checked for all possible SLA violations before the DR is sla_checked). This is not a cheap operation since a single TI could have 3 SLAs, hence the additional computation and IO could easily outweigh the benefit of filtering out sla_checked DRs. |
Option 1 doesn't guarantee correctness right? i.e. if there are more dagruns that need to be checked than the preset limit, some of them will be ignored? With regards to performance comparison between option 1 and option 2, aren't we already checking all the TIs for the 100 fetched dag runs in option 1? |
True. I guess the way to do it (if no addition column is added) would be to remove the the fixed count, and simply do
This way we only get DR that have yet to succeed (since we made an assumption that successful DRs are free from SLA check).
We are checking whether these TIs are violating SLAs, not whether these TIs are free from SLAs, those are different checks (e.g., to check if a TI violates expected_duration, we compare the current duration with the SLA; to check if a TI is free from SLA violations, we assert on that the TI has finished within the expected_duration). To do so would require us adding another column to TI as well. I'm slightly inclined towards option 1 (probably need to remove the 100 fixed limit), but definitely open to other opinions. :) |
If we can add |
f857566
to
cd3408f
Compare
hi @seanxwzhang any updates on this patch? |
Unfortunately, I won't be able to continue working on this patch, happy to hand it over to others. |
to take it over where should some one focus? fixing conflicts first? |
I guess with how far behind this change is and how many conflicts it has, starting from the scratch following the ideas here is far better idea. |
@auvipy are you currently working on this PR? If not, I'm happy to take the ideas and open up a new one that's works out of the DagFileProcessor/DagFileProcessingManager |
please go ahead. feel free to ping me for review |
Hi everyone, I've spent a lot of time collecting all reported concerns that the community has had regarding SLAs to date. After much deliberation, I've reached the conclusion that we might be better off defining the Airflow-native SLA feature only at the DAG level, where it can be supported to users' expectations, and leave the task-level SLA definition to the users. There are three main reasons to why I think task-level SLAs should be implemented by the users instead of by Airflow.
In contrast, I believe DAG-level SLA will strictly be a positive feature. It will increase the general reliability of Airflow DAGs and even be able to alert us on job delays when undefined behaviors happen, all without negatively impacting the performance of the scheduler. If you have been interested in the SLA mechanism, or have been actively using the current version of the SLA mechanism, I would love to get your feedback on this proposal. I would love to work with you to try to come up with an SLA solution that meets user expectations! |
This PR tries to land #3584. Most changes are from @Eronarn, I rebased on master and added a few tests.
As per @Eronarn :
Make sure to mark the boxes below before creating PR: [x]
In case of fundamental code change, Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in UPDATING.md.
Read the Pull Request Guidelines for more information.