Replies: 6 comments 5 replies
-
Thanks for opening your first issue here! Be sure to follow the issue template! If you are willing to raise PR to address this issue please do so, no need to wait for approval. |
Beta Was this translation helpful? Give feedback.
-
Converted it into discussion, because this is not really something actionable, and likely a deployment issue. I think you should take a look at the deployment logs of yours - specifically look in detail in the logs of EKS and whether ther are any correlated events with the SIGTERMS of yours. IMHO - this is a result of simply some of your machines in the cluster being restarted - but what is the reason for that, hard to say. I think in order for someone to help you - you should make a deep look in your deployment logs and rather than show the details of airflow configuration show what happens with your deployment there. It would also be useful (for anyone who would be looking at it - not necessary me, I might not have time to look at detail) it might also be useful to see the exact logs of the sigterm being received, your deployment/EKS logs when you look deeply there should also show at least what is sending those TERM events. You should also look at things like your health check logs and other components in K8S that might get restarted. |
Beta Was this translation helpful? Give feedback.
-
What executor are you using? In the config it is Kubernetes and in the helm, but also in the helm you have workers defined with replicas: 1, is worker running but not being used? |
Beta Was this translation helpful? Give feedback.
-
Hi @potiuk, we are also experiencing this issue. I tried to debug and found that whenever SIGTERM is received, scheduler logs contains this info at the exact same time: I couldn't find any other instances of this info log but if this is indeed from where it's getting logged, I am not able to figure out how it gets into this condition where pod status is marked as Here are the detailed logsScheduler logs when SIGTERM received:
Corresponding task logs:
Deployment Info:
|
Beta Was this translation helpful? Give feedback.
-
@mschueler, @msardana94 I am curious whether you were able to solve the issue. We are facing the same issue after migrating from spot.io to karpenter. After a day of troubleshooting I have strong feeling that it is happening because of Taint nodes with a NoSchedule for Consolidation before Validation begins #651 issue on Karpenter I don't want to mislead you but just an idea to check, yet I am still investigating on my end :) |
Beta Was this translation helpful? Give feedback.
-
@mschueler did you ever figure out what was causing this ? We are also facing this exact same issue after migrating to the airflow official helm chart from the airflow community chart and have no idea what's causing this. We've debugged and ruled out almost all potential kubernetes related issues. |
Beta Was this translation helpful? Give feedback.
-
Apache Airflow version
Other Airflow 2 version (please specify below)
What happened
We are seeing intermittent SIGTERMs on DAGs. There seems to be no rhyme or reason to the SIGTERMs (e.g. seems to happen to all our DAGs at some time or another, no pattern to the timing, etc)
The deploy is thru Helm chart to an EKS cluster running on EKS. It's happening in our nonprod and prod clusters both. We've tried different things in our nonprod environment to fix it, basically following ideas we found from Google searches (increasing resources, upgrading airflow version, checking logs [we've found nothing useful in the logs but will post as much info as I can here], increasing timeouts, and trying some settings* we found mentioned in other GitHub issues.
` - name: AIRFLOW__CORE__KILLED_TASK_CLEANUP_TIME
value: "3600"
value: "false"`
Nonprod: EKS 1.26 / Airflow 2.5.1.
Prod: EKS 1.25 / Airflow 2.2.4
Focusing efforts on Nonprod but just wanted to mention we're seeing the issue on multiple versions. Also, believe the original version we started on was 2.0.x something but we've been struggling with this issue since January (when we first started to setup Airflow 2.0 on k8s). As a workaround we are doing a retry where possible.
This is the exact error:
airflow.exceptions.AirflowException: Task received SIGTERM signal
Would truly appreciate any help or insight into what we're doing wrong. I've tried to put as much information below as possible but if I'm missing something, please let me know.
Helm values file:
resulting Airflow.cfg (configmap)
What you think should happen instead
No response
How to reproduce
Intermittent. Schedule a DAG run.
Operating System
Kubernetes -- DAGs running on image based on Debian Bullseye
Versions of Apache Airflow Providers
apache-airflow-providers-grpc 3.1.0 gRPC
apache-airflow-providers-hashicorp 3.3.0 Hashicorp including Hashicorp Vault
apache-airflow-providers-http 4.2.0 Hypertext Transfer Protocol (HTTP)
apache-airflow-providers-imap 3.1.1 Internet Message Access Protocol (IMAP)
apache-airflow-providers-microsoft-azure 5.2.1 Microsoft Azure
apache-airflow-providers-microsoft-mssql 2.1.3 Microsoft SQL Server (MSSQL)
apache-airflow-providers-mysql 2.2.3 MySQL
apache-airflow-providers-odbc 3.2.1 ODBC
apache-airflow-providers-oracle 2.2.3 Oracle
apache-airflow-providers-postgres 5.4.0 PostgreSQL
apache-airflow-providers-redis 3.1.0 Redis
apache-airflow-providers-sendgrid 3.1.0 Sendgrid
apache-airflow-providers-sftp 2.6.0 SSH File Transfer Protocol (SFTP)
apache-airflow-providers-slack 7.2.0 Slack
apache-airflow-providers-snowflake 2.1.1 Snowflake
apache-airflow-providers-sqlite 3.3.1 SQLite
apache-airflow-providers-ssh 3.5.0 Secure Shell (SSH)
apache-airflow-providers-tableau 2.1.8 Tableau
Deployment
Official Apache Airflow Helm Chart
Deployment details
EKS 1.25 running Karpenter (cluster autoscaler replacement)
Anything else
Intermittent -- 5 - 50x a day
Are you willing to submit PR?
Code of Conduct
Beta Was this translation helpful? Give feedback.
All reactions