Intermittent SIGTERM running on K8S #32543

mschueler · 2023-07-11T23:11:51Z

mschueler
Jul 11, 2023

Apache Airflow version

Other Airflow 2 version (please specify below)

What happened

We are seeing intermittent SIGTERMs on DAGs. There seems to be no rhyme or reason to the SIGTERMs (e.g. seems to happen to all our DAGs at some time or another, no pattern to the timing, etc)

The deploy is thru Helm chart to an EKS cluster running on EKS. It's happening in our nonprod and prod clusters both. We've tried different things in our nonprod environment to fix it, basically following ideas we found from Google searches (increasing resources, upgrading airflow version, checking logs [we've found nothing useful in the logs but will post as much info as I can here], increasing timeouts, and trying some settings* we found mentioned in other GitHub issues.

` - name: AIRFLOW__CORE__KILLED_TASK_CLEANUP_TIME
value: "3600"

name: AIRFLOW__SCHEDULER__SCHEDULE_AFTER_TASK_EXECUTION
value: "false"`

Nonprod: EKS 1.26 / Airflow 2.5.1.

Prod: EKS 1.25 / Airflow 2.2.4

Focusing efforts on Nonprod but just wanted to mention we're seeing the issue on multiple versions. Also, believe the original version we started on was 2.0.x something but we've been struggling with this issue since January (when we first started to setup Airflow 2.0 on k8s). As a workaround we are doing a retry where possible.

This is the exact error:

airflow.exceptions.AirflowException: Task received SIGTERM signal

Would truly appreciate any help or insight into what we're doing wrong. I've tried to put as much information below as possible but if I'm missing something, please let me know.

Helm values file:

#` User and group of airflow user
airflowHome: /opt/airflow
airflowPodAnnotations:
  ad.datadoghq.com/tolerate-unready: "true"
  ad.datadoghq.com/webserver.check_names: '["airflow"]'
  ad.datadoghq.com/webserver.init_configs: "[{}]"
  ad.datadoghq.com/webserver.instances: '[{"url": "https://airflow.dev.eks.xxxx.com"}]'
  ad.datadoghq.com/webserver.logs: '[{"source":"airflow", "service": "airflow"}]'

defaultAirflowRepository: apache/airflow
defaultAirflowTag: 2.5.2
airflowVersion: 2.5.2

##########################################
## COMPONENT | Airflow images and gitsync
##########################################
images:
  airflow:
    pullPolicy: IfNotPresent
    repository: 007601687147.dkr.ecr.us-east-1.amazonaws.com/airflow
    tag: "33-dev"
  gitSync:
    pullPolicy: IfNotPresent
    repository: k8s.gcr.io/git-sync/git-sync
    tag: v3.3.0
  pgbouncer:
    pullPolicy: IfNotPresent
    repository: apache/airflow
    tag: airflow-pgbouncer-2021.04.28-1.14.0
  pgbouncerExporter:
    pullPolicy: IfNotPresent
    repository: apache/airflow
    tag: airflow-pgbouncer-exporter-2021.09.22-0.12.0
  pod_template:
    pullPolicy: IfNotPresent
    repository: null
    tag: null
  statsd:
    pullPolicy: IfNotPresent
    repository: apache/airflow
    tag: airflow-statsd-exporter-2021.04.28-v0.17.0

##########################################
## COMPONENT | Load balancer configs
##########################################
ingress:
  enabled: true
  web:
    annotations:
      kubernetes.io/ingress.class: alb
      alb.ingress.kubernetes.io/scheme: internal
      alb.ingress.kubernetes.io/target-type: ip
      alb.ingress.kubernetes.io/success-codes: 200,302
      alb.ingress.kubernetes.io/inbound-cidrs: 10.0.0.0/8,172.16.0.0/12,192.168.0.0/16
      alb.ingress.kubernetes.io/listen-ports: '[{"HTTP": 80}, {"HTTPS":443}]'
      alb.ingress.kubernetes.io/actions.ssl-redirect: '{"Type": "redirect", "RedirectConfig": { "Protocol": "HTTPS", "Port": "443", "StatusCode": "HTTP_301"}}'
      alb.ingress.kubernetes.io/tags: environment=dev,[email protected],business_app=eks-cluster,Name=airflow-ingress
      alb.ingress.kubernetes.io/certificate-arn: arn:aws:acm:us-east-1:xxxx:certificate/a26e6adb-9e75-4048-baa1-8ae08e2f8dd4
    path: "/*"
    pathType: "ImplementationSpecific"
    hosts:
      - airflow.dev.eks.xxxxx.com
    precedingPaths:
      - path: "/*"
        serviceName: "ssl-redirect"
        servicePort: "use-annotation"
        pathType: "ImplementationSpecific"
    succeedingPaths: []
    tls:
      enabled: false
      secretName: ""

# `airflow_local_settings` file as a string (can be templated).
airflowLocalSettings: null

# Enable RBAC (default on most clusters these days)
rbac:
  create: true

# Airflow executor
executor: KubernetesExecutor

# Environment variables for all airflow containers
env:
  - name: AIRFLOW__LOGGING__FAB_LOGGING_LEVEL
    value: DEBUG

allowPodLaunching: true

# Custom secrets
extraSecrets:
  airflow-ssh-secret:
    data: |
      gitSshKey: 'xxxx'
  airflow-db:
    data: |
      connection: 'xxxxxx'

# Extra env 'items' that will be added to the definition of airflow containers
extraEnv: |-
  - name: AIRFLOW__METRICS__STATSD_HOST
    valueFrom:
      fieldRef:
        fieldPath: status.hostIP
  - name: AWS_DEFAULT_REGION
    value: us-east-1
  - name: AIRFLOW__LOGGING__FAB_LOGGING_LEVEL
    value: INFO
  - name: AIRFLOW__CORE__KILLED_TASK_CLEANUP_TIME
    value: "3600"
  - name: AIRFLOW__SCHEDULER__SCHEDULE_AFTER_TASK_EXECUTION
    value: "false"

# Airflow database config
data:
  metadataSecretName: airflow-db

# Fernet key settings
# Note: fernetKey can only be set during install, not upgrade
fernetKey: null
fernetKeySecretName: null

###################################
## COMPONENT | Airflow Workers
###################################
workers:
  persistence:
    enabled: false
    fixPermissions: false
  nodeSelector:
    node.kubernetes.io/instance-type: c5d.2xlarge
  tolerations:
    - effect: NoSchedule
      key: allowed_jobs
      value: airflow
      operator: Equal
  resources:
    limits:
      memory: 4000Mi
    requests:
      memory: 4000Mi
  replicas: 1
  safeToEvict: true
  serviceAccount:
    create: true
    name: null
    annotations:
      eks.amazonaws.com/role-arn: arn:aws:iam::xxxxx:role/airflow-eks-devops-dev-s3-irsa
  strategy:
    rollingUpdate:
      maxSurge: 100%
      maxUnavailable: 50%
  updateStrategy: null
  extraVolumes:
    - name: temp-worker-data
      persistentVolumeClaim:
        claimName: airflow-temp-workers-efs-claim
  extraVolumeMounts:
    - name: temp-worker-data
      mountPath: /opt/airflow/worker_data/

###################################
## COMPONENT | Airflow Scheduler
###################################
scheduler:
  livenessProbe:
    failureThreshold: 5
    initialDelaySeconds: 10
    periodSeconds: 60
    timeoutSeconds: 20
  nodeSelector:
    node.kubernetes.io/instance-type: t3.2xlarge
  podDisruptionBudget:
    config:
      maxUnavailable: 1
    enabled: false
  replicas: 1
  safeToEvict: true
  serviceAccount:
    create: true
    name: null
    annotations:
      eks.amazonaws.com/role-arn: arn:aws:iam::xxxxx:role/airflow-eks-devops-dev-s3-irsa
  extraVolumes:
    - name: temp-worker-data
      persistentVolumeClaim:
        claimName: airflow-temp-workers-efs-claim
  extraVolumeMounts:
    - name: temp-worker-data
      mountPath: /opt/airflow/worker_data/

###################################
## COMPONENT | Airflow Webserver
###################################
webserver:
  allowPodLogReading: true
  defaultUser:
    email: [email protected]
    enabled: true
    firstName: admin1
    lastName: user
    password: xxxx
    role: Admin
    username: admin
  livenessProbe:
    initialDelaySeconds: 15
    timeoutSeconds: 30
    failureThreshold: 20
    periodSeconds: 5
  readinessProbe:
    failureThreshold: 20
    initialDelaySeconds: 15
    periodSeconds: 5
    timeoutSeconds: 30
  serviceAccount:
    create: true
    name: ~
    annotations:
  replicas: 2
  service:
    type: ClusterIP
    ports:
      - name: airflow-ui
        port: 80
        targetPort: airflow-ui
  strategy: null
  webserverConfig: |-
    import os
    from airflow import configuration as conf
    from flask_appbuilder.security.manager import AUTH_LDAP

    CSRF_ENABLED = True
    AUTH_TYPE = AUTH_LDAP
    AUTH_LDAP_SERVER = "ldap://ldap-aws.xxxx.com:389"
    AUTH_LDAP_USE_TLS = False

    AUTH_USER_REGISTRATION = True
    AUTH_USER_REGISTRATION_ROLE = "Admin"
    AUTH_LDAP_FIRSTNAME_FIELD = "givenName"
    AUTH_LDAP_LASTNAME_FIELD = "sn"
    AUTH_LDAP_EMAIL_FIELD = "mail"

    AUTH_LDAP_USERNAME_FORMAT = "xxx"
    AUTH_LDAP_SEARCH = "xxxx"
    AUTH_LDAP_UID_FIELD = "SamAccountName"
    AUTH_LDAP_SEARCH_FILTER = "xxxxxxx"
    AUTH_LDAP_GROUP_FIELD = "memberOf"
      
    AUTH_ROLES_SYNC_AT_LOGIN = True
    PERMANENT_SESSION_LIFETIME = 600

# Overriding airflow flower
flower:
  enabled: false

# Overriding airflow statsd
statsd:
  enabled: false

##########################################
## COMPONENT | PgBouncer
##########################################
pgbouncer:
  ciphers: normal
  configSecretName: null
  enabled: true
  logConnections: 1
  logDisconnections: 1
  maxClientConn: 100
  metadataPoolSize: 10
  podDisruptionBudget:
    config:
      maxUnavailable: 1
    enabled: false
  resultBackendPoolSize: 5
  serviceAccount:
    create: true
    name: null
  ssl:
    ca: null
    cert: null
    key: null
  sslmode: prefer

# Overriding redis config
redis:
  enabled: false

# All ports used by chart
ports:
  airflowUI: 8080
  pgbouncer: 6543
  pgbouncerScrape: 9127
  statsdIngest: 8125
  workerLogs: 8793

# This runs as a CronJob to cleanup old pods
cleanup:
  enabled: true
  schedule: "*/15 * * * *"
  serviceAccount:
    create: true
    name: airflow

# Overriding postgres config
postgresql:
  enabled: false

# Config settings to go into the mounted airflow.cfg
config:
  core:
    load_examples: "False"
    load_default_connections: "False"
    parallelism: 300
    default_pool_task_slot_count: 300
    max_active_tasks_per_dag: 100
    max_active_runs_per_dag: 1
    executor: "{{ .Values.executor }}"
    remote_logging: "True"
    dagbag_import_timeout: 60
  email:
    email_backend: airflow.utils.email.send_email_smtp
  smtp:
    smtp_host: mailhost.dynata.com
    smtp_starttls: False
    smtp_ssl: False
    smtp_port: 25
    smtp_mail_from: [email protected]
  logging:
    colored_console_log: "False"
    remote_logging: "True"
    remote_base_log_folder: s3://airflow-dev-eks
    remote_log_conn_id: aws_default
  metrics:
    statsd_on: true
    statsd_port: 8125
    statsd_prefix: airflow
  webserver:
    base_url: https://airflow.dev.eks.dynata.com
  scheduler:
    enable_health_check: "True"
  kubernetes:
    worker_pods_creation_batch_size: 100

# Overriding pod template
podTemplate: null

###################################
## COMPONENT | Airflow dags
###################################
dags:
  gitSync:
    # branch: airflow-v2-testin
    branch: main
    containerName: git-sync
    depth: 1
    enabled: true
    env: []
    extraVolumeMounts: []
    maxFailures: 0
    repo: [email protected]:dynata/airflow.git
    rev: HEAD
    sshKeySecret: airflow-ssh-secret
    # subPath: dags_v2
    subPath: dags
    uid: 50000
    wait: 30
  persistence:
    enabled: false

# Overriding logs
logs:
  persistence:
    enabled: false

resulting Airflow.cfg (configmap)

[celery]
flower_url_prefix = /
worker_concurrency = 16

[celery_kubernetes_executor]
kubernetes_queue = kubernetes

[core]
colored_console_log = False
dagbag_import_timeout = 60
dags_folder = /opt/airflow/dags/repo/dags
default_pool_task_slot_count = 300
executor = KubernetesExecutor
load_default_connections = False
load_examples = False
max_active_runs_per_dag = 1
max_active_tasks_per_dag = 100
parallelism = 300
remote_logging = True

[elasticsearch]
json_format = True
log_id_template = {dag_id}_{task_id}_{execution_date}_{try_number}

[elasticsearch_configs]
max_retries = 3
retry_timeout = True
timeout = 30

[email]
email_backend = airflow.utils.email.send_email_smtp

[kerberos]
ccache = /var/kerberos-ccache/cache
keytab = /etc/airflow.keytab
principal = [email protected]
reinit_frequency = 3600

[kubernetes]
airflow_configmap = airflow-airflow-config
airflow_local_settings_configmap = airflow-airflow-config
multi_namespace_mode = False
namespace = data-platform
pod_template_file = /opt/airflow/pod_templates/pod_template_file.yaml
worker_container_repository = xxxxx.dkr.ecr.us-east-1.amazonaws.com/airflow
worker_container_tag = 33-dev
worker_pods_creation_batch_size = 100

[logging]
colored_console_log = False
remote_base_log_folder = s3://airflow-dev-xxxx
remote_log_conn_id = aws_default
remote_logging = True

[metrics]
statsd_host = airflow-statsd
statsd_on = true
statsd_port = 8125
statsd_prefix = airflow

[scheduler]
enable_health_check = True
run_duration = 41460
standalone_dag_processor = False
statsd_host = airflow-statsd
statsd_on = False
statsd_port = 9125
statsd_prefix = airflow

[smtp]
smtp_host = mailhost.xxx.com
smtp_mail_from = [email protected]
smtp_port = 25
smtp_ssl = false
smtp_starttls = false

[webserver]
base_url = https://airflow.dev.eks.xxxx.com
enable_proxy_fix = True
rbac = True

What you think should happen instead

No response

How to reproduce

Intermittent. Schedule a DAG run.

Operating System

Kubernetes -- DAGs running on image based on Debian Bullseye

Versions of Apache Airflow Providers

apache-airflow-providers-amazon	7.3.0	Amazon integration (including Amazon Web Services (AWS)).
apache-airflow-providers-celery	3.1.0	Celery
apache-airflow-providers-cncf-kubernetes	5.2.2	Kubernetes
apache-airflow-providers-common-sql	1.3.4	Common SQL Provider
apache-airflow-providers-datadog	2.0.4	Datadog
apache-airflow-providers-docker	3.5.1	Docker
apache-airflow-providers-elasticsearch	4.4.0	Elasticsearch
apache-airflow-providers-ftp	3.3.1	File Transfer Protocol (FTP)
apache-airflow-providers-google	8.11.0	Google services including: - Google Ads - Google Cloud (GCP) - Google Firebase - Google LevelDB - Google Marketing Platform - Google Workspace (formerly Google Suite)
apache-airflow-providers-grpc	3.1.0	gRPC
apache-airflow-providers-hashicorp	3.3.0	Hashicorp including Hashicorp Vault
apache-airflow-providers-http	4.2.0	Hypertext Transfer Protocol (HTTP)
apache-airflow-providers-imap	3.1.1	Internet Message Access Protocol (IMAP)
apache-airflow-providers-microsoft-azure	5.2.1	Microsoft Azure
apache-airflow-providers-microsoft-mssql	2.1.3	Microsoft SQL Server (MSSQL)
apache-airflow-providers-mysql	2.2.3	MySQL
apache-airflow-providers-odbc	3.2.1	ODBC
apache-airflow-providers-oracle	2.2.3	Oracle
apache-airflow-providers-postgres	5.4.0	PostgreSQL
apache-airflow-providers-redis	3.1.0	Redis
apache-airflow-providers-sendgrid	3.1.0	Sendgrid
apache-airflow-providers-sftp	2.6.0	SSH File Transfer Protocol (SFTP)
apache-airflow-providers-slack	7.2.0	Slack
apache-airflow-providers-snowflake	2.1.1	Snowflake
apache-airflow-providers-sqlite	3.3.1	SQLite
apache-airflow-providers-ssh	3.5.0	Secure Shell (SSH)
apache-airflow-providers-tableau	2.1.8	Tableau

[apache-airflow-providers-amazon](https://airflow.apache.org/docs/apache-airflow-providers-amazon/7.3.0/) 7.3.0 Amazon integration (including [Amazon Web Services (AWS)](https://aws.amazon.com/)). [apache-airflow-providers-celery](https://airflow.apache.org/docs/apache-airflow-providers-celery/3.1.0/) 3.1.0 [Celery](http://www.celeryproject.org/) [apache-airflow-providers-cncf-kubernetes](https://airflow.apache.org/docs/apache-airflow-providers-cncf-kubernetes/5.2.2/) 5.2.2 [Kubernetes](https://kubernetes.io/) [apache-airflow-providers-common-sql](https://airflow.apache.org/docs/apache-airflow-providers-common-sql/1.3.4/) 1.3.4 [Common SQL Provider](https://en.wikipedia.org/wiki/SQL) [apache-airflow-providers-datadog](https://airflow.apache.org/docs/apache-airflow-providers-datadog/2.0.4/) 2.0.4 [Datadog](https://www.datadoghq.com/) [apache-airflow-providers-docker](https://airflow.apache.org/docs/apache-airflow-providers-docker/3.5.1/) 3.5.1 [Docker](https://docs.docker.com/install/) [apache-airflow-providers-elasticsearch](https://airflow.apache.org/docs/apache-airflow-providers-elasticsearch/4.4.0/) 4.4.0 [Elasticsearch](https://www.elastic.co/elasticsearch) [apache-airflow-providers-ftp](https://airflow.apache.org/docs/apache-airflow-providers-ftp/3.3.1/) 3.3.1 [File Transfer Protocol (FTP)](https://tools.ietf.org/html/rfc114) [apache-airflow-providers-google](https://airflow.apache.org/docs/apache-airflow-providers-google/8.11.0/) 8.11.0 Google services including:

Google Ads
Google Cloud (GCP)
Google Firebase
Google LevelDB
Google Marketing Platform
Google Workspace (formerly Google Suite)
apache-airflow-providers-grpc 3.1.0 gRPC
apache-airflow-providers-hashicorp 3.3.0 Hashicorp including Hashicorp Vault
apache-airflow-providers-http 4.2.0 Hypertext Transfer Protocol (HTTP)
apache-airflow-providers-imap 3.1.1 Internet Message Access Protocol (IMAP)
apache-airflow-providers-microsoft-azure 5.2.1 Microsoft Azure
apache-airflow-providers-microsoft-mssql 2.1.3 Microsoft SQL Server (MSSQL)
apache-airflow-providers-mysql 2.2.3 MySQL
apache-airflow-providers-odbc 3.2.1 ODBC
apache-airflow-providers-oracle 2.2.3 Oracle
apache-airflow-providers-postgres 5.4.0 PostgreSQL
apache-airflow-providers-redis 3.1.0 Redis
apache-airflow-providers-sendgrid 3.1.0 Sendgrid
apache-airflow-providers-sftp 2.6.0 SSH File Transfer Protocol (SFTP)
apache-airflow-providers-slack 7.2.0 Slack
apache-airflow-providers-snowflake 2.1.1 Snowflake
apache-airflow-providers-sqlite 3.3.1 SQLite
apache-airflow-providers-ssh 3.5.0 Secure Shell (SSH)
apache-airflow-providers-tableau 2.1.8 Tableau

Deployment

Official Apache Airflow Helm Chart

Deployment details

EKS 1.25 running Karpenter (cluster autoscaler replacement)

Anything else

Intermittent -- 5 - 50x a day

Are you willing to submit PR?

Yes I am willing to submit a PR!

Code of Conduct

I agree to follow this project's Code of Conduct

2023-07-11T23:11:54Z

boring-cyborg[bot]
bot Jul 11, 2023

Thanks for opening your first issue here! Be sure to follow the issue template! If you are willing to raise PR to address this issue please do so, no need to wait for approval.

0 replies

potiuk · 2023-07-12T06:26:17Z

potiuk
Jul 12, 2023
Collaborator

Converted it into discussion, because this is not really something actionable, and likely a deployment issue.

I think you should take a look at the deployment logs of yours - specifically look in detail in the logs of EKS and whether ther are any correlated events with the SIGTERMS of yours. IMHO - this is a result of simply some of your machines in the cluster being restarted - but what is the reason for that, hard to say. I think in order for someone to help you - you should make a deep look in your deployment logs and rather than show the details of airflow configuration show what happens with your deployment there.

It would also be useful (for anyone who would be looking at it - not necessary me, I might not have time to look at detail) it might also be useful to see the exact logs of the sigterm being received, your deployment/EKS logs when you look deeply there should also show at least what is sending those TERM events.

You should also look at things like your health check logs and other components in K8S that might get restarted.

4 replies

mschueler Jul 12, 2023
Author

Hi @potiuk thanks for replying. We've spent the last 5-6 months trying to track this down, we consider our team kubernetes experts and believe we have ruled out any of those variables (the nodes airflow jobs running on, aren't being restarted, for example). We've ruled out OOM kills (raised memory request to an extremely high value just to rule things out)

The deployment logs, unfortunately, don't show anything related to SIGTERM's which is what has made this issue so frustrating to troubleshoot. We've literally gone through every SIGTERM Github issue in Airflow and also when Googling "Airflow Sigterm kubernetes" we get many many hits with various recommendations and believe we've exhausted them.

We run quite a huge kubernetes cluster and don't have issues like this on anything besides Airflow. We've literally just run the given helm chart. Would this issue be more appropriate to open in the helm chart Issues?

potiuk Jul 12, 2023
Collaborator

Hi @potiuk thanks for replying. We've spent the last 5-6 months trying to track this down, we consider our team kubernetes experts and believe we have ruled out any of those variables (the nodes airflow jobs running on, aren't being restarted, for example). We've ruled out OOM kills (raised memory request to an extremely high value just to rule things out)

The deployment logs, unfortunately, don't show anything related to SIGTERM's which is what has made this issue so frustrating to troubleshoot. We've literally gone through every SIGTERM Github issue in Airflow and also when Googling "Airflow Sigterm kubernetes" we get many many hits with various recommendations and believe we've exhausted them.

We run quite a huge kubernetes cluster and don't have issues like this on anything besides Airflow. We've literally just run the given helm chart. Would this issue be more appropriate to open in the helm chart Issues?

I have to believe you, and I am sorry you have such an experience, but facing the reality, none here has any more information than you have about what's going on on your system and it's going to be even more difficult without having even minimal information from logs and circumstances to be able to help you. For sure it is not a "common" problem. yes people sometimes have similar issues but taking into account hudreds of thousands of Airlfow installations there, we would be flooded with similar issues if it was a "common" theme.

If we see no logs or circumstances describing what happens, it's simply impossible to help you to diagnose the problem and often even come up with plausible hypothesis.

Wild guess:

What I can suggest you - one of the reasons why it could happen is that for whatever reason the heartbeating mechanisms got slow or misconfigured. Possibly one of those parameters:

Maybe a combinations of some slowness of hardware, load, etc, cause tha some of the thresholds are too low and heartbeats are exceeding the threasholds and airlfow tasks get killed because they are not heartbeating often enough.

But, that would usually be visible in logs correlated with the events that are happening - for example if you see a TERM in a worker looking in other logs for events that could trigger it is likely a good idea - without logs correlated with such events, it's next to impossible to say what could be the reason other than wild guessing. So if you could attempt to provide more logs, that could be helpful. That's one of the ways that can come by knowledge. And only you can help here.

And since you seem to have a fairly reproducible case, there is another way if it is hard or impossible to correlate events manually. By experimentation and bisecting. This is also quite viable approach in engineering when you have a complex system built on top of multiple layers of abstraction (cloud, vms,. networking, filesystem, db, k8s, helm, python, resources etc).

Assuming this is one of the parameters above, make some intelligent guesses, make a hypothesis and change of them - to a much higher value usually. And see if it helps, if it solves the problem - perfect (and then we can try to figure out what could be the reason if you report it back), If it increases/decreases the number of problems. that might also give some hints and narrow down the problem. Unfortunately - again - this is only something you and your team can do. I am happy to try to help to interpret the results of such an experimentation - but without having such results or logs, I am sorry but I can't help much.

Also please remember that this is an OSS project: free for use, people here try to help in their free time, trying to come up with wild guesses based on incomplete information, without having access to your systems and being able to inspect and analyse it ( and basically doing it in our free time that we could have spend on trying to improve the project). So there is an option also to hire someone who can get access to your system and make even more deep analysis and guess. Which is a third option.

potiuk Jul 12, 2023
Collaborator

And one more thing - debug log might give more insight and more correlation opportunities if this is of any help.

mschueler Jul 12, 2023
Author

First off, really appreciate you taking the time to respond -- I'm sure you're very busy and if it sounded like I was complaining that wasn't my intent. I agree, it's very possible it's something in my environment -- just trying to figure out how to get to the bottom of the issue.

Will be going through your recommendations and gathering pertinent logs.

For the third option -- yes, we may be at that point quite soon. If you have any recommendations (or are even open for hire yourself!) that would be great.

Thank you again

arkadiusz-bach · 2023-07-15T09:26:18Z

arkadiusz-bach
Jul 15, 2023

What executor are you using? In the config it is Kubernetes and in the helm, but also in the helm you have workers defined with replicas: 1, is worker running but not being used?

0 replies

msardana94 · 2023-07-26T10:36:19Z

msardana94
Jul 26, 2023

Hi @potiuk, we are also experiencing this issue. I tried to debug and found that whenever SIGTERM is received, scheduler logs contains this info at the exact same time: Skipping event for Succeeded pod <pod_name> - event for this pod already sent to executor. While trying to investigate further by looking at the code, I noticed that this gets logged here.

I couldn't find any other instances of this info log but if this is indeed from where it's getting logged, I am not able to figure out how it gets into this condition where pod status is marked as Succeeded along with having either of these conditions true. We had setup Kubernetes cluster (EKS) only recently and are definitely not the experts at this point so any help with how to debug this further will help. Also, we used to get SIGTERM due to memory errors which got resolved after we changed the node types.

Here are the detailed logs

Scheduler logs when SIGTERM received:

[2023-07-26T07:40:28.338+0000] {scheduler_job_runner.py:1553} INFO - Resetting orphaned tasks for active dag runs
[2023-07-26T07:40:41.996+0000] {kubernetes_executor.py:557} INFO - Found 0 queued task instances
[2023-07-26T07:41:42.101+0000] {kubernetes_executor.py:557} INFO - Found 0 queued task instances
[2023-07-26T07:42:42.898+0000] {kubernetes_executor.py:557} INFO - Found 0 queued task instances
[2023-07-26T07:43:43.028+0000] {kubernetes_executor.py:557} INFO - Found 0 queued task instances
[2023-07-26T07:44:43.396+0000] {kubernetes_executor.py:557} INFO - Found 0 queued task instances
[2023-07-26T07:45:28.434+0000] {scheduler_job_runner.py:1553} INFO - Resetting orphaned tasks for active dag runs
[2023-07-26T07:45:43.736+0000] {kubernetes_executor.py:557} INFO - Found 0 queued task instances
[2023-07-26T07:46:44.135+0000] {kubernetes_executor.py:557} INFO - Found 0 queued task instances
[2023-07-26T07:47:44.463+0000] {kubernetes_executor.py:557} INFO - Found 0 queued task instances
[2023-07-26T07:48:45.535+0000] {kubernetes_executor.py:557} INFO - Found 0 queued task instances
[2023-07-26T07:49:12.861+0000] {kubernetes_executor.py:257} INFO - Event: some-pod-name is Running
[2023-07-26T07:49:33.776+0000] {kubernetes_executor.py:243} INFO - Skipping event for Succeeded pod some-pod-name - event for this pod already sent to executor
[2023-07-26T07:49:33.798+0000] {kubernetes_executor.py:243} INFO - Skipping event for Succeeded pod some-pod-name - event for this pod already sent to executor
[2023-07-26T07:49:33.818+0000] {kubernetes_executor.py:243} INFO - Skipping event for Succeeded pod some-pod-name - event for this pod already sent to executor
[2023-07-26T07:49:45.552+0000] {kubernetes_executor.py:557} INFO - Found 0 queued task instances

Corresponding task logs:

[2023-07-26, 13:19:12 IST] {local_task_job_runner.py:115} ERROR - Received SIGTERM. Terminating subprocesses
[2023-07-26, 13:19:12 IST] {process_utils.py:131} INFO - Sending Signals.SIGTERM to group 17. PIDs of all processes in the group: [17]
[2023-07-26, 13:19:12 IST] {process_utils.py:86} INFO - Sending the signal Signals.SIGTERM to group 17
[2023-07-26, 13:19:12 IST] {taskinstance.py:1517} ERROR - Received SIGTERM. Terminating subprocesses.

Deployment Info:

EKS 1.25.0
apache-airflow==2.6.2
apache-airflow-providers-cncf-kubernetes==6.1.0
kubernetes==23.6.0

1 reply

potiuk Aug 6, 2023
Collaborator

Unfortunatley - I have no big experience in debugging K8S Executor issues. But surely it looks like some race condition when statuis of the pod gets calculated. We have been busy with 2.7 release but maybe @dstandish or @ephraimbuddy might take a look at this after that.

javidaslan · 2024-08-28T08:06:53Z

javidaslan
Aug 28, 2024

@mschueler, @msardana94 I am curious whether you were able to solve the issue. We are facing the same issue after migrating from spot.io to karpenter. After a day of troubleshooting I have strong feeling that it is happening because of Taint nodes with a NoSchedule for Consolidation before Validation begins #651 issue on Karpenter

I don't want to mislead you but just an idea to check, yet I am still investigating on my end :)

0 replies

b32n · 2025-02-21T12:46:39Z

b32n
Feb 21, 2025

@mschueler did you ever figure out what was causing this ? We are also facing this exact same issue after migrating to the airflow official helm chart from the airflow community chart and have no idea what's causing this. We've debugged and ruled out almost all potential kubernetes related issues.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Intermittent SIGTERM running on K8S #32543

{{title}}

Replies: 6 comments 5 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Intermittent SIGTERM running on K8S #32543

mschueler Jul 11, 2023

Apache Airflow version

What happened

What you think should happen instead

How to reproduce

Operating System

Versions of Apache Airflow Providers

Deployment

Deployment details

Anything else

Are you willing to submit PR?

Code of Conduct

Replies: 6 comments · 5 replies

boring-cyborg[bot] bot Jul 11, 2023

potiuk Jul 12, 2023 Collaborator

mschueler Jul 12, 2023 Author

potiuk Jul 12, 2023 Collaborator

potiuk Jul 12, 2023 Collaborator

mschueler Jul 12, 2023 Author

arkadiusz-bach Jul 15, 2023

msardana94 Jul 26, 2023

potiuk Aug 6, 2023 Collaborator

javidaslan Aug 28, 2024

b32n Feb 21, 2025

mschueler
Jul 11, 2023

Replies: 6 comments 5 replies

boring-cyborg[bot]
bot Jul 11, 2023

potiuk
Jul 12, 2023
Collaborator

mschueler Jul 12, 2023
Author

potiuk Jul 12, 2023
Collaborator

potiuk Jul 12, 2023
Collaborator

mschueler Jul 12, 2023
Author

arkadiusz-bach
Jul 15, 2023

msardana94
Jul 26, 2023

potiuk Aug 6, 2023
Collaborator

javidaslan
Aug 28, 2024

b32n
Feb 21, 2025