Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sync attempt with killed pod hangs eternally #48879

Open
henriquemeloo opened this issue Dec 10, 2024 · 12 comments
Open

Sync attempt with killed pod hangs eternally #48879

henriquemeloo opened this issue Dec 10, 2024 · 12 comments
Labels
area/platform issues related to the platform community team/platform-move type/bug Something isn't working

Comments

@henriquemeloo
Copy link
Contributor

Helm Chart Version

1.2.0

What step the error happened?

During the Sync

Relevant information

When the pod for a sync gets killed externally (e.g. by cluster downsizing or running nodes on EC2 spot instances), the sync just hangs there without starting a new attempt or failing. I would expect the heartbeat mechanism (which I did not intentionally configure either on or off) to flag this attempt as failed, but the sync's logs just freeze and the sync is still tagged as running in the webapp. I couldn't find any logs in the other pods relative to this specific job ID. What I did find were these (possibly unrelated) logs from the server deployment:

Relevant log output

2024-12-10 13:53:15,461 [io-executor-thread-2]	WARN	i.a.c.s.c.JobConverter(getWorkspaceId):403 - Unable to retrieve workspace ID for job null.
java.lang.NullPointerException: null
    at com.google.common.base.Preconditions.checkNotNull(Preconditions.java:904)
    at com.google.common.cache.LocalCache.get(LocalCache.java:4016)
    at com.google.common.cache.LocalCache.getOrLoad(LocalCache.java:4040)
    at com.google.common.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4989)
    at io.airbyte.persistence.job.WorkspaceHelper.lambda$getWorkspaceForJobId$4(WorkspaceHelper.java:162)
    at io.airbyte.persistence.job.WorkspaceHelper.handleCacheExceptions(WorkspaceHelper.java:231)
    at io.airbyte.persistence.job.WorkspaceHelper.getWorkspaceForJobId(WorkspaceHelper.java:162)
    at io.airbyte.commons.server.converters.JobConverter.getWorkspaceId(JobConverter.java:401)
    at io.airbyte.commons.server.converters.JobConverter.getAttemptLogs(JobConverter.java:320)
    at io.airbyte.commons.server.converters.JobConverter.getSynchronousJobRead(JobConverter.java:349)
    at io.airbyte.commons.server.converters.JobConverter.getSynchronousJobRead(JobConverter.java:344)
    at io.airbyte.commons.server.handlers.SchedulerHandler.retrieveDiscoveredSchema(SchedulerHandler.java:567)
    at io.airbyte.commons.server.handlers.SchedulerHandler.discoverAndGloballyDisable(SchedulerHandler.java:400)
    at io.airbyte.commons.server.handlers.SchedulerHandler.discoverSchemaForSourceFromSourceId(SchedulerHandler.java:365)
    at io.airbyte.commons.server.handlers.WebBackendConnectionsHandler.getRefreshedSchema(WebBackendConnectionsHandler.java:488)
    at io.airbyte.commons.server.handlers.WebBackendConnectionsHandler.webBackendGetConnection(WebBackendConnectionsHandler.java:432)
    at io.airbyte.server.apis.WebBackendApiController.lambda$webBackendGetConnection$2(WebBackendApiController.java:113)
    at io.airbyte.server.apis.ApiHelper.execute(ApiHelper.kt:29)
    at io.airbyte.server.apis.WebBackendApiController.webBackendGetConnection(WebBackendApiController.java:103)
    at io.airbyte.server.apis.$WebBackendApiController$Definition$Exec.dispatch(Unknown Source)
    at io.micronaut.context.AbstractExecutableMethodsDefinition$DispatchedExecutableMethod.invokeUnsafe(AbstractExecutableMethodsDefinition.java:461)
    at io.micronaut.context.DefaultBeanContext$BeanContextUnsafeExecutionHandle.invokeUnsafe(DefaultBeanContext.java:4350)
    at io.micronaut.web.router.AbstractRouteMatch.execute(AbstractRouteMatch.java:272)
    at io.micronaut.web.router.DefaultUriRouteMatch.execute(DefaultUriRouteMatch.java:38)
    at io.micronaut.http.server.RouteExecutor.executeRouteAndConvertBody(RouteExecutor.java:498)
    at io.micronaut.http.server.RouteExecutor.lambda$callRoute$5(RouteExecutor.java:475)
    at io.micronaut.core.execution.ExecutionFlow.lambda$async$1(ExecutionFlow.java:87)
    at io.micronaut.core.propagation.PropagatedContext.lambda$wrap$3(PropagatedContext.java:211)
    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
    at java.base/java.lang.Thread.run(Thread.java:1583)
@henriquemeloo
Copy link
Contributor Author

henriquemeloo commented Dec 10, 2024

The cron service might be dead though. From the cron I get these logs:

2024-12-10 13:31:52,128 [scheduled-executor-thread-4]	ERROR	i.m.s.DefaultTaskExceptionHandler(handle):47 - Error invoking scheduled task for bean [io.airbyte.cron.jobs.SelfHealTemporalWorkflows@757950e4] RESOURCE_EXHAUSTED: namespace rate limit exceeded
io.grpc.StatusRuntimeException: RESOURCE_EXHAUSTED: namespace rate limit exceeded
    at io.grpc.stub.ClientCalls.toStatusRuntimeException(ClientCalls.java:268)
    at io.grpc.stub.ClientCalls.getUnchecked(ClientCalls.java:249)
    at io.grpc.stub.ClientCalls.blockingUnaryCall(ClientCalls.java:167)
    at io.temporal.api.workflowservice.v1.WorkflowServiceGrpc$WorkflowServiceBlockingStub.listClosedWorkflowExecutions(WorkflowServiceGrpc.java:4620)
    at io.airbyte.commons.temporal.WorkflowServiceStubsWrapped.lambda$blockingStubListClosedWorkflowExecutions$0(WorkflowServiceStubsWrapped.java:47)
    at dev.failsafe.Functions.lambda$toCtxSupplier$11(Functions.java:243)
    at dev.failsafe.Functions.lambda$get$0(Functions.java:46)
    at dev.failsafe.internal.RetryPolicyExecutor.lambda$apply$0(RetryPolicyExecutor.java:74)
    at dev.failsafe.SyncExecutionImpl.executeSync(SyncExecutionImpl.java:187)
    at dev.failsafe.FailsafeExecutor.call(FailsafeExecutor.java:376)
    at dev.failsafe.FailsafeExecutor.get(FailsafeExecutor.java:112)
    at io.airbyte.commons.temporal.RetryHelper.withRetries(RetryHelper.java:60)
    at io.airbyte.commons.temporal.WorkflowServiceStubsWrapped.withRetries(WorkflowServiceStubsWrapped.java:67)
    at io.airbyte.commons.temporal.WorkflowServiceStubsWrapped.blockingStubListClosedWorkflowExecutions(WorkflowServiceStubsWrapped.java:47)
    at io.airbyte.commons.temporal.TemporalClient.fetchClosedWorkflowsByStatus(TemporalClient.java:139)
    at io.airbyte.commons.temporal.TemporalClient.restartClosedWorkflowByStatus(TemporalClient.java:117)
    at io.airbyte.cron.jobs.SelfHealTemporalWorkflows.cleanTemporal(SelfHealTemporalWorkflows.java:42)
    at io.airbyte.cron.jobs.$SelfHealTemporalWorkflows$Definition$Exec.dispatch(Unknown Source)
    at io.micronaut.context.AbstractExecutableMethodsDefinition$DispatchedExecutableMethod.invoke(AbstractExecutableMethodsDefinition.java:456)
    at io.micronaut.inject.DelegatingExecutableMethod.invoke(DelegatingExecutableMethod.java:86)
    at io.micronaut.context.bind.DefaultExecutableBeanContextBinder$ContextBoundExecutable.invoke(DefaultExecutableBeanContextBinder.java:152)
    at io.micronaut.scheduling.processor.ScheduledMethodProcessor.lambda$scheduleTask$2(ScheduledMethodProcessor.java:160)
    at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:572)
    at java.base/java.util.concurrent.FutureTask.runAndReset(FutureTask.java:358)
    at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:305)
    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
    at java.base/java.lang.Thread.run(Thread.java:1583)

@marcosmarxm
Copy link
Member

@airbytehq/move-platform can someone take a look into this issue?

@hvignolo87
Copy link

I'm facing the same situation. Is there any news related to this issue?

@gosusnp
Copy link
Contributor

gosusnp commented Mar 17, 2025

@henriquemeloo, you assessment seems right. In the event where a job pod disappears, hard failure or node getting taken away, there's a timeout related to the heartbeat that should catch it.
The cron is the application enforcing this. One way to confirm this from the logs would be to look for Checking for non heartbeating workload.

@hvignolo87 , same answer, can you confirm if the cron is running?

@hvignolo87
Copy link

@henriquemeloo, you assessment seems right. In the event where a job pod disappears, hard failure or node getting taken away, there's a timeout related to the heartbeat that should catch it. The cron is the application enforcing this. One way to confirm this from the logs would be to look for Checking for non heartbeating workload.

@hvignolo87 , same answer, can you confirm if the cron is running?

Hi @gosusnp! Thanks for your reply!

The cron is up and running, and there's no match for Checking for non heartbeating workload in its logs.

@gosusnp
Copy link
Contributor

gosusnp commented Mar 19, 2025

@hvignolo87 , are you running 1.2.0 as well? if not, which version are you running?

Also, do you observe this sync hanging issue frequently? I have seen some activity around the cron recently, there should a few fixes in the next version.

@hvignolo87
Copy link

@hvignolo87 , are you running 1.2.0 as well? if not, which version are you running?

Yes, I'm running 1.2.0 in EKS, and it's the same behavior described in this issue. Furthermore, I'm using karpenter 1.0 with 2 NodePools configured: one for the airbyte's components, and the other for the jobs (sync, check, discover, and spec).

Also, do you observe this sync hanging issue frequently? I have seen some activity around the cron recently, there should a few fixes in the next version.

Yes, this was an everyday issue until I figured out how to define the resources for the non-sync jobs through the environment variables (unfortunately, not documented). After that, and although this issue continued to happen, the frequency was significantly reduced. Let's say it's happening 1-2 times per week, depending on karpenter's actions regarding disruption.

@gosusnp
Copy link
Contributor

gosusnp commented Mar 21, 2025

@hvignolo87, I have been over our recent commit history, I believe the fix to enable the missing component has been merged in Jan and is part of version 1.4.
I'd recommend updating to the latest.

@hvignolo87
Copy link

hvignolo87 commented Mar 21, 2025

It seems like it's a private repo, because that link doesn't work for me 😓

I've been working to better understand the problem and found a workaround. I'll share it here in case it's useful to someone else who is in my situation, where I don't have enough time to update the platform version at the moment.

1. Set the resources for non-sync jobs (if you're using karpenter like me)

Add these env vars in workload-launcher.extraEnv[] (modify as needed):

 extraEnv:
   - name: SPEC_JOB_MAIN_CONTAINER_CPU_REQUEST
     value: "200m"
   - name: SPEC_JOB_MAIN_CONTAINER_MEMORY_REQUEST
     value: "512Mi"
   - name: CHECK_JOB_MAIN_CONTAINER_CPU_REQUEST
     value: "200m"
   - name: CHECK_JOB_MAIN_CONTAINER_MEMORY_REQUEST
     value: "512Mi"
   - name: DISCOVER_JOB_MAIN_CONTAINER_CPU_REQUEST
     value: "200m"
   - name: DISCOVER_JOB_MAIN_CONTAINER_MEMORY_REQUEST
     value: "512Mi"
2. Create a ConfigMap to override the default temporal configuration

Searching for open issues related to this one, I've found this comment that was very useful.

Create a custom configMap with these parameters, and apply it manuall using kubectl apply -f ....

apiVersion: v1
kind: ConfigMap
metadata: # Adjust these as needed
  name: airbyte-platform-temporal-dynamicconfig-override
  namespace: airbyte-prod
data:
  "development.yaml": |
    frontend.namespaceCount:
      - value: 4096
        constraints: {}
    frontend.namespaceRPS.visibility:
      - value: 100
        constraints: {}
    frontend.namespaceBurst.visibility:
      - value: 150
        constraints: {}
    frontend.namespaceRPS:
      - value: 76800
        constraints: {}
    frontend.enableClientVersionCheck:
      - value: true
        constraints: {}
    frontend.persistenceMaxQPS:
      - value: 5000
        constraints: {}
    frontend.throttledLogRPS:
      - value: 200
        constraints: {}
    frontend.enableUpdateWorkflowExecution:
      - value: true
    frontend.enableUpdateWorkflowExecutionAsyncAccepted:
      - value: true
    history.historyMgrNumConns:
      - value: 50
        constraints: {}
    system.advancedVisibilityWritingMode:
      - value: "off"
        constraints: {}
    history.defaultActivityRetryPolicy:
      - value:
          InitialIntervalInSeconds: 1
          MaximumIntervalCoefficient: 100.0
          BackoffCoefficient: 2.0
          MaximumAttempts: 0
    history.defaultWorkflowRetryPolicy:
      - value:
          InitialIntervalInSeconds: 1
          MaximumIntervalCoefficient: 100.0
          BackoffCoefficient: 2.0
          MaximumAttempts: 0
    # Limit for responses. This mostly impacts discovery jobs since they have the largest responses.
    limit.blobSize.error:
      - value: 15728640 # 15MB
        constraints: {}
    limit.blobSize.warn:
      - value: 10485760 # 10MB
        constraints: {}
3. Override the config in the temporal pod

Override the DYNAMIC_CONFIG_FILE_PATH env var, and mount the new volume with the data extracted from the configMap.

temporal:
  # -- Additional env vars for temporal pod(s).
  extraEnv:
    - name: DYNAMIC_CONFIG_FILE_PATH
      value: config/dynamicconfig-override/development.yaml # The path to override/patch

  # -- Additional volumeMounts for temporal containers
  extraVolumeMounts:
    - name: airbyte-temporal-dynamicconfig-override
      mountPath: "/etc/temporal/config/dynamicconfig-override/"

  # -- Additional volumes for temporal pods
  extraVolumes:
    - name: airbyte-temporal-dynamicconfig-override
      configMap:
        name: airbyte-platform-temporal-dynamicconfig-override # The configMap created in step 2
        items:
          - key: development.yaml
            path: development.yaml

After creating these resources and deploying the changes, is working like a charm 🔧

@gosusnp
Copy link
Contributor

gosusnp commented Mar 21, 2025

Correct link is airbytehq/airbyte-platform@6b61cf1

@hvignolo87
Copy link

Correct link is airbytehq/airbyte-platform@6b61cf1

I'm not sure I understand why that change fixes this problem 🤔

@gosusnp
Copy link
Contributor

gosusnp commented Mar 22, 2025

The component in the airbyte-cron that is enforcing timeout wasn't enabled properly. The PR I linked is addressing the main issue here.

Heads up, I don't the configuration overrides you uncovered are meant to stay in the workload-launcher.
I am also not sure how the temporal configuration override relates to this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/platform issues related to the platform community team/platform-move type/bug Something isn't working
Projects
None yet
Development

No branches or pull requests

5 participants