-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sync attempt with killed pod hangs eternally #48879
Comments
The cron service might be dead though. From the cron I get these logs:
|
@airbytehq/move-platform can someone take a look into this issue? |
I'm facing the same situation. Is there any news related to this issue? |
@henriquemeloo, you assessment seems right. In the event where a job pod disappears, hard failure or node getting taken away, there's a timeout related to the heartbeat that should catch it. @hvignolo87 , same answer, can you confirm if the |
Hi @gosusnp! Thanks for your reply! The |
@hvignolo87 , are you running Also, do you observe this sync hanging issue frequently? I have seen some activity around the cron recently, there should a few fixes in the next version. |
Yes, I'm running
Yes, this was an everyday issue until I figured out how to define the resources for the non-sync jobs through the environment variables (unfortunately, not documented). After that, and although this issue continued to happen, the frequency was significantly reduced. Let's say it's happening 1-2 times per week, depending on karpenter's actions regarding disruption. |
@hvignolo87, I have been over our recent commit history, I believe the fix to enable the missing component has been merged in Jan and is part of version 1.4. |
It seems like it's a private repo, because that link doesn't work for me 😓 I've been working to better understand the problem and found a workaround. I'll share it here in case it's useful to someone else who is in my situation, where I don't have enough time to update the platform version at the moment. 1. Set the resources for non-sync jobs (if you're using karpenter like me)Add these env vars in extraEnv:
- name: SPEC_JOB_MAIN_CONTAINER_CPU_REQUEST
value: "200m"
- name: SPEC_JOB_MAIN_CONTAINER_MEMORY_REQUEST
value: "512Mi"
- name: CHECK_JOB_MAIN_CONTAINER_CPU_REQUEST
value: "200m"
- name: CHECK_JOB_MAIN_CONTAINER_MEMORY_REQUEST
value: "512Mi"
- name: DISCOVER_JOB_MAIN_CONTAINER_CPU_REQUEST
value: "200m"
- name: DISCOVER_JOB_MAIN_CONTAINER_MEMORY_REQUEST
value: "512Mi" 2. Create a ConfigMap to override the default temporal configurationSearching for open issues related to this one, I've found this comment that was very useful. Create a custom apiVersion: v1
kind: ConfigMap
metadata: # Adjust these as needed
name: airbyte-platform-temporal-dynamicconfig-override
namespace: airbyte-prod
data:
"development.yaml": |
frontend.namespaceCount:
- value: 4096
constraints: {}
frontend.namespaceRPS.visibility:
- value: 100
constraints: {}
frontend.namespaceBurst.visibility:
- value: 150
constraints: {}
frontend.namespaceRPS:
- value: 76800
constraints: {}
frontend.enableClientVersionCheck:
- value: true
constraints: {}
frontend.persistenceMaxQPS:
- value: 5000
constraints: {}
frontend.throttledLogRPS:
- value: 200
constraints: {}
frontend.enableUpdateWorkflowExecution:
- value: true
frontend.enableUpdateWorkflowExecutionAsyncAccepted:
- value: true
history.historyMgrNumConns:
- value: 50
constraints: {}
system.advancedVisibilityWritingMode:
- value: "off"
constraints: {}
history.defaultActivityRetryPolicy:
- value:
InitialIntervalInSeconds: 1
MaximumIntervalCoefficient: 100.0
BackoffCoefficient: 2.0
MaximumAttempts: 0
history.defaultWorkflowRetryPolicy:
- value:
InitialIntervalInSeconds: 1
MaximumIntervalCoefficient: 100.0
BackoffCoefficient: 2.0
MaximumAttempts: 0
# Limit for responses. This mostly impacts discovery jobs since they have the largest responses.
limit.blobSize.error:
- value: 15728640 # 15MB
constraints: {}
limit.blobSize.warn:
- value: 10485760 # 10MB
constraints: {} 3. Override the config in the temporal podOverride the temporal:
# -- Additional env vars for temporal pod(s).
extraEnv:
- name: DYNAMIC_CONFIG_FILE_PATH
value: config/dynamicconfig-override/development.yaml # The path to override/patch
# -- Additional volumeMounts for temporal containers
extraVolumeMounts:
- name: airbyte-temporal-dynamicconfig-override
mountPath: "/etc/temporal/config/dynamicconfig-override/"
# -- Additional volumes for temporal pods
extraVolumes:
- name: airbyte-temporal-dynamicconfig-override
configMap:
name: airbyte-platform-temporal-dynamicconfig-override # The configMap created in step 2
items:
- key: development.yaml
path: development.yaml After creating these resources and deploying the changes, is working like a charm 🔧 |
Correct link is airbytehq/airbyte-platform@6b61cf1 |
I'm not sure I understand why that change fixes this problem 🤔 |
The component in the Heads up, I don't the configuration overrides you uncovered are meant to stay in the |
Helm Chart Version
1.2.0
What step the error happened?
During the Sync
Relevant information
When the pod for a sync gets killed externally (e.g. by cluster downsizing or running nodes on EC2 spot instances), the sync just hangs there without starting a new attempt or failing. I would expect the heartbeat mechanism (which I did not intentionally configure either on or off) to flag this attempt as failed, but the sync's logs just freeze and the sync is still tagged as running in the webapp. I couldn't find any logs in the other pods relative to this specific job ID. What I did find were these (possibly unrelated) logs from the server deployment:
Relevant log output
The text was updated successfully, but these errors were encountered: