Sync attempt with killed pod hangs eternally #48879

henriquemeloo · 2024-12-10T14:21:25Z

Helm Chart Version

1.2.0

What step the error happened?

During the Sync

Relevant information

When the pod for a sync gets killed externally (e.g. by cluster downsizing or running nodes on EC2 spot instances), the sync just hangs there without starting a new attempt or failing. I would expect the heartbeat mechanism (which I did not intentionally configure either on or off) to flag this attempt as failed, but the sync's logs just freeze and the sync is still tagged as running in the webapp. I couldn't find any logs in the other pods relative to this specific job ID. What I did find were these (possibly unrelated) logs from the server deployment:

Relevant log output

2024-12-10 13:53:15,461 [io-executor-thread-2]	WARN	i.a.c.s.c.JobConverter(getWorkspaceId):403 - Unable to retrieve workspace ID for job null.
java.lang.NullPointerException: null
    at com.google.common.base.Preconditions.checkNotNull(Preconditions.java:904)
    at com.google.common.cache.LocalCache.get(LocalCache.java:4016)
    at com.google.common.cache.LocalCache.getOrLoad(LocalCache.java:4040)
    at com.google.common.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4989)
    at io.airbyte.persistence.job.WorkspaceHelper.lambda$getWorkspaceForJobId$4(WorkspaceHelper.java:162)
    at io.airbyte.persistence.job.WorkspaceHelper.handleCacheExceptions(WorkspaceHelper.java:231)
    at io.airbyte.persistence.job.WorkspaceHelper.getWorkspaceForJobId(WorkspaceHelper.java:162)
    at io.airbyte.commons.server.converters.JobConverter.getWorkspaceId(JobConverter.java:401)
    at io.airbyte.commons.server.converters.JobConverter.getAttemptLogs(JobConverter.java:320)
    at io.airbyte.commons.server.converters.JobConverter.getSynchronousJobRead(JobConverter.java:349)
    at io.airbyte.commons.server.converters.JobConverter.getSynchronousJobRead(JobConverter.java:344)
    at io.airbyte.commons.server.handlers.SchedulerHandler.retrieveDiscoveredSchema(SchedulerHandler.java:567)
    at io.airbyte.commons.server.handlers.SchedulerHandler.discoverAndGloballyDisable(SchedulerHandler.java:400)
    at io.airbyte.commons.server.handlers.SchedulerHandler.discoverSchemaForSourceFromSourceId(SchedulerHandler.java:365)
    at io.airbyte.commons.server.handlers.WebBackendConnectionsHandler.getRefreshedSchema(WebBackendConnectionsHandler.java:488)
    at io.airbyte.commons.server.handlers.WebBackendConnectionsHandler.webBackendGetConnection(WebBackendConnectionsHandler.java:432)
    at io.airbyte.server.apis.WebBackendApiController.lambda$webBackendGetConnection$2(WebBackendApiController.java:113)
    at io.airbyte.server.apis.ApiHelper.execute(ApiHelper.kt:29)
    at io.airbyte.server.apis.WebBackendApiController.webBackendGetConnection(WebBackendApiController.java:103)
    at io.airbyte.server.apis.$WebBackendApiController$Definition$Exec.dispatch(Unknown Source)
    at io.micronaut.context.AbstractExecutableMethodsDefinition$DispatchedExecutableMethod.invokeUnsafe(AbstractExecutableMethodsDefinition.java:461)
    at io.micronaut.context.DefaultBeanContext$BeanContextUnsafeExecutionHandle.invokeUnsafe(DefaultBeanContext.java:4350)
    at io.micronaut.web.router.AbstractRouteMatch.execute(AbstractRouteMatch.java:272)
    at io.micronaut.web.router.DefaultUriRouteMatch.execute(DefaultUriRouteMatch.java:38)
    at io.micronaut.http.server.RouteExecutor.executeRouteAndConvertBody(RouteExecutor.java:498)
    at io.micronaut.http.server.RouteExecutor.lambda$callRoute$5(RouteExecutor.java:475)
    at io.micronaut.core.execution.ExecutionFlow.lambda$async$1(ExecutionFlow.java:87)
    at io.micronaut.core.propagation.PropagatedContext.lambda$wrap$3(PropagatedContext.java:211)
    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
    at java.base/java.lang.Thread.run(Thread.java:1583)

henriquemeloo · 2024-12-10T14:22:31Z

The cron service might be dead though. From the cron I get these logs:

2024-12-10 13:31:52,128 [scheduled-executor-thread-4]	ERROR	i.m.s.DefaultTaskExceptionHandler(handle):47 - Error invoking scheduled task for bean [io.airbyte.cron.jobs.SelfHealTemporalWorkflows@757950e4] RESOURCE_EXHAUSTED: namespace rate limit exceeded
io.grpc.StatusRuntimeException: RESOURCE_EXHAUSTED: namespace rate limit exceeded
    at io.grpc.stub.ClientCalls.toStatusRuntimeException(ClientCalls.java:268)
    at io.grpc.stub.ClientCalls.getUnchecked(ClientCalls.java:249)
    at io.grpc.stub.ClientCalls.blockingUnaryCall(ClientCalls.java:167)
    at io.temporal.api.workflowservice.v1.WorkflowServiceGrpc$WorkflowServiceBlockingStub.listClosedWorkflowExecutions(WorkflowServiceGrpc.java:4620)
    at io.airbyte.commons.temporal.WorkflowServiceStubsWrapped.lambda$blockingStubListClosedWorkflowExecutions$0(WorkflowServiceStubsWrapped.java:47)
    at dev.failsafe.Functions.lambda$toCtxSupplier$11(Functions.java:243)
    at dev.failsafe.Functions.lambda$get$0(Functions.java:46)
    at dev.failsafe.internal.RetryPolicyExecutor.lambda$apply$0(RetryPolicyExecutor.java:74)
    at dev.failsafe.SyncExecutionImpl.executeSync(SyncExecutionImpl.java:187)
    at dev.failsafe.FailsafeExecutor.call(FailsafeExecutor.java:376)
    at dev.failsafe.FailsafeExecutor.get(FailsafeExecutor.java:112)
    at io.airbyte.commons.temporal.RetryHelper.withRetries(RetryHelper.java:60)
    at io.airbyte.commons.temporal.WorkflowServiceStubsWrapped.withRetries(WorkflowServiceStubsWrapped.java:67)
    at io.airbyte.commons.temporal.WorkflowServiceStubsWrapped.blockingStubListClosedWorkflowExecutions(WorkflowServiceStubsWrapped.java:47)
    at io.airbyte.commons.temporal.TemporalClient.fetchClosedWorkflowsByStatus(TemporalClient.java:139)
    at io.airbyte.commons.temporal.TemporalClient.restartClosedWorkflowByStatus(TemporalClient.java:117)
    at io.airbyte.cron.jobs.SelfHealTemporalWorkflows.cleanTemporal(SelfHealTemporalWorkflows.java:42)
    at io.airbyte.cron.jobs.$SelfHealTemporalWorkflows$Definition$Exec.dispatch(Unknown Source)
    at io.micronaut.context.AbstractExecutableMethodsDefinition$DispatchedExecutableMethod.invoke(AbstractExecutableMethodsDefinition.java:456)
    at io.micronaut.inject.DelegatingExecutableMethod.invoke(DelegatingExecutableMethod.java:86)
    at io.micronaut.context.bind.DefaultExecutableBeanContextBinder$ContextBoundExecutable.invoke(DefaultExecutableBeanContextBinder.java:152)
    at io.micronaut.scheduling.processor.ScheduledMethodProcessor.lambda$scheduleTask$2(ScheduledMethodProcessor.java:160)
    at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:572)
    at java.base/java.util.concurrent.FutureTask.runAndReset(FutureTask.java:358)
    at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:305)
    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
    at java.base/java.lang.Thread.run(Thread.java:1583)

marcosmarxm · 2024-12-16T19:34:13Z

@airbytehq/move-platform can someone take a look into this issue?

hvignolo87 · 2025-03-17T21:15:54Z

I'm facing the same situation. Is there any news related to this issue?

gosusnp · 2025-03-17T23:43:10Z

@henriquemeloo, you assessment seems right. In the event where a job pod disappears, hard failure or node getting taken away, there's a timeout related to the heartbeat that should catch it.
The cron is the application enforcing this. One way to confirm this from the logs would be to look for Checking for non heartbeating workload.

@hvignolo87 , same answer, can you confirm if the cron is running?

hvignolo87 · 2025-03-18T14:01:24Z

@henriquemeloo, you assessment seems right. In the event where a job pod disappears, hard failure or node getting taken away, there's a timeout related to the heartbeat that should catch it. The cron is the application enforcing this. One way to confirm this from the logs would be to look for Checking for non heartbeating workload.

@hvignolo87 , same answer, can you confirm if the cron is running?

Hi @gosusnp! Thanks for your reply!

The cron is up and running, and there's no match for Checking for non heartbeating workload in its logs.

gosusnp · 2025-03-19T00:49:48Z

@hvignolo87 , are you running 1.2.0 as well? if not, which version are you running?

Also, do you observe this sync hanging issue frequently? I have seen some activity around the cron recently, there should a few fixes in the next version.

hvignolo87 · 2025-03-19T12:30:01Z

@hvignolo87 , are you running 1.2.0 as well? if not, which version are you running?

Yes, I'm running 1.2.0 in EKS, and it's the same behavior described in this issue. Furthermore, I'm using karpenter 1.0 with 2 NodePools configured: one for the airbyte's components, and the other for the jobs (sync, check, discover, and spec).

Also, do you observe this sync hanging issue frequently? I have seen some activity around the cron recently, there should a few fixes in the next version.

Yes, this was an everyday issue until I figured out how to define the resources for the non-sync jobs through the environment variables (unfortunately, not documented). After that, and although this issue continued to happen, the frequency was significantly reduced. Let's say it's happening 1-2 times per week, depending on karpenter's actions regarding disruption.

gosusnp · 2025-03-21T22:03:19Z

@hvignolo87, I have been over our recent commit history, I believe the fix to enable the missing component has been merged in Jan and is part of version 1.4.
I'd recommend updating to the latest.

hvignolo87 · 2025-03-21T23:30:57Z

It seems like it's a private repo, because that link doesn't work for me 😓

I've been working to better understand the problem and found a workaround. I'll share it here in case it's useful to someone else who is in my situation, where I don't have enough time to update the platform version at the moment.

1. Set the resources for non-sync jobs (if you're using karpenter like me)

Add these env vars in workload-launcher.extraEnv[] (modify as needed):

 extraEnv:
   - name: SPEC_JOB_MAIN_CONTAINER_CPU_REQUEST
     value: "200m"
   - name: SPEC_JOB_MAIN_CONTAINER_MEMORY_REQUEST
     value: "512Mi"
   - name: CHECK_JOB_MAIN_CONTAINER_CPU_REQUEST
     value: "200m"
   - name: CHECK_JOB_MAIN_CONTAINER_MEMORY_REQUEST
     value: "512Mi"
   - name: DISCOVER_JOB_MAIN_CONTAINER_CPU_REQUEST
     value: "200m"
   - name: DISCOVER_JOB_MAIN_CONTAINER_MEMORY_REQUEST
     value: "512Mi"

2. Create a ConfigMap to override the default temporal configuration

Searching for open issues related to this one, I've found this comment that was very useful.

Create a custom configMap with these parameters, and apply it manuall using kubectl apply -f ....

apiVersion: v1
kind: ConfigMap
metadata: # Adjust these as needed
  name: airbyte-platform-temporal-dynamicconfig-override
  namespace: airbyte-prod
data:
  "development.yaml": |
    frontend.namespaceCount:
      - value: 4096
        constraints: {}
    frontend.namespaceRPS.visibility:
      - value: 100
        constraints: {}
    frontend.namespaceBurst.visibility:
      - value: 150
        constraints: {}
    frontend.namespaceRPS:
      - value: 76800
        constraints: {}
    frontend.enableClientVersionCheck:
      - value: true
        constraints: {}
    frontend.persistenceMaxQPS:
      - value: 5000
        constraints: {}
    frontend.throttledLogRPS:
      - value: 200
        constraints: {}
    frontend.enableUpdateWorkflowExecution:
      - value: true
    frontend.enableUpdateWorkflowExecutionAsyncAccepted:
      - value: true
    history.historyMgrNumConns:
      - value: 50
        constraints: {}
    system.advancedVisibilityWritingMode:
      - value: "off"
        constraints: {}
    history.defaultActivityRetryPolicy:
      - value:
          InitialIntervalInSeconds: 1
          MaximumIntervalCoefficient: 100.0
          BackoffCoefficient: 2.0
          MaximumAttempts: 0
    history.defaultWorkflowRetryPolicy:
      - value:
          InitialIntervalInSeconds: 1
          MaximumIntervalCoefficient: 100.0
          BackoffCoefficient: 2.0
          MaximumAttempts: 0
    # Limit for responses. This mostly impacts discovery jobs since they have the largest responses.
    limit.blobSize.error:
      - value: 15728640 # 15MB
        constraints: {}
    limit.blobSize.warn:
      - value: 10485760 # 10MB
        constraints: {}

3. Override the config in the temporal pod

Override the DYNAMIC_CONFIG_FILE_PATH env var, and mount the new volume with the data extracted from the configMap.

temporal:
  # -- Additional env vars for temporal pod(s).
  extraEnv:
    - name: DYNAMIC_CONFIG_FILE_PATH
      value: config/dynamicconfig-override/development.yaml # The path to override/patch

  # -- Additional volumeMounts for temporal containers
  extraVolumeMounts:
    - name: airbyte-temporal-dynamicconfig-override
      mountPath: "/etc/temporal/config/dynamicconfig-override/"

  # -- Additional volumes for temporal pods
  extraVolumes:
    - name: airbyte-temporal-dynamicconfig-override
      configMap:
        name: airbyte-platform-temporal-dynamicconfig-override # The configMap created in step 2
        items:
          - key: development.yaml
            path: development.yaml

After creating these resources and deploying the changes, is working like a charm 🔧

gosusnp · 2025-03-21T23:38:42Z

Correct link is airbytehq/airbyte-platform@6b61cf1

hvignolo87 · 2025-03-22T00:02:05Z

Correct link is airbytehq/airbyte-platform@6b61cf1

I'm not sure I understand why that change fixes this problem 🤔

gosusnp · 2025-03-22T01:22:01Z

The component in the airbyte-cron that is enforcing timeout wasn't enabled properly. The PR I linked is addressing the main issue here.

Heads up, I don't the configuration overrides you uncovered are meant to stay in the workload-launcher.
I am also not sure how the temporal configuration override relates to this issue.

henriquemeloo added area/platform issues related to the platform needs-triage type/bug Something isn't working labels Dec 10, 2024

octavia-squidington-iii added autoteam team/compose team/platform-move community labels Dec 10, 2024

marcosmarxm removed needs-triage team/compose autoteam labels Dec 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sync attempt with killed pod hangs eternally #48879

Sync attempt with killed pod hangs eternally #48879

henriquemeloo commented Dec 10, 2024

henriquemeloo commented Dec 10, 2024 •

edited

Loading

marcosmarxm commented Dec 16, 2024

hvignolo87 commented Mar 17, 2025

gosusnp commented Mar 17, 2025

hvignolo87 commented Mar 18, 2025

gosusnp commented Mar 19, 2025

hvignolo87 commented Mar 19, 2025

gosusnp commented Mar 21, 2025 •

edited

Loading

hvignolo87 commented Mar 21, 2025 •

edited

Loading

gosusnp commented Mar 21, 2025

hvignolo87 commented Mar 22, 2025

gosusnp commented Mar 22, 2025

Sync attempt with killed pod hangs eternally #48879

Sync attempt with killed pod hangs eternally #48879

Comments

henriquemeloo commented Dec 10, 2024

Helm Chart Version

What step the error happened?

Relevant information

Relevant log output

henriquemeloo commented Dec 10, 2024 • edited Loading

marcosmarxm commented Dec 16, 2024

hvignolo87 commented Mar 17, 2025

gosusnp commented Mar 17, 2025

hvignolo87 commented Mar 18, 2025

gosusnp commented Mar 19, 2025

hvignolo87 commented Mar 19, 2025

gosusnp commented Mar 21, 2025 • edited Loading

hvignolo87 commented Mar 21, 2025 • edited Loading

gosusnp commented Mar 21, 2025

hvignolo87 commented Mar 22, 2025

gosusnp commented Mar 22, 2025

henriquemeloo commented Dec 10, 2024 •

edited

Loading

gosusnp commented Mar 21, 2025 •

edited

Loading

hvignolo87 commented Mar 21, 2025 •

edited

Loading