You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Overlord cannot respond to requests when kafka is abnormal
Affected Version
28.0.1
Description
Please include as much detailed information about the problem as possible.
Cluster size
3-node cluster
Configurations in use
cluster configuration
Steps to reproduce the problem
Stop one or two kafka nodes in the Kafka cluster and stop or start a large number of supervisors
The error message or stack traces encountered. Providing more context, such as nearby log messages or even entire logs, can be helpful.
Overlord experienced a large number of connection timeouts
"qtp1009260571-132" #132 daemon prio=5 os_prio=0 cpu=110.66ms elapsed=14488.50s tid=0x000055cf4f39c800 nid=0x1a96a7 waiting for monitor entry [0x00007f5e36c12000]
java.lang.Thread.State: BLOCKED (on object monitor)
at org.apache.druid.indexing.seekablestream.supervisor.SeekableStreamSupervisor.stop(SeekableStreamSupervisor.java:933)
- waiting to lock <0x00000000eb35fae0> (a java.lang.Object)
at org.apache.druid.indexing.overlord.supervisor.SupervisorManager.possiblyStopAndRemoveSupervisorInternal(SupervisorManager.java:268)
at org.apache.druid.indexing.overlord.supervisor.SupervisorManager.stopAndRemoveSupervisor(SupervisorManager.java:105)
- locked <0x00000000f56ddbd8> (a java.lang.Object)
at org.apache.druid.indexing.overlord.supervisor.SupervisorResource.lambda$terminate$9(SupervisorResource.java:348)
at org.apache.druid.indexing.overlord.supervisor.SupervisorResource$$Lambda$731/574536258.apply(Unknown Source)
at org.apache.druid.indexing.overlord.supervisor.SupervisorResource.asLeaderWithSupervisorManager(SupervisorResource.java:476)
at org.apache.druid.indexing.overlord.supervisor.SupervisorResource.terminate(SupervisorResource.java:346)
at org.apache.druid.indexing.overlord.supervisor.SupervisorResource.shutdown(SupervisorResource.java:337)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at com.sun.jersey.spi.container.JavaMethodInvokerFactory$1.invoke(JavaMethodInvokerFactory.java:60)
at com.sun.jersey.server.impl.model.method.dispatch.AbstractResourceMethodDispatchProvider$ResponseOutInvoker._dispatch(AbstractResourceMethodDispatchProvider.java:205)
at com.sun.jersey.server.impl.model.method.dispatch.ResourceJavaMethodDispatcher.dispatch(ResourceJavaMethodDispatcher.java:75)
at com.sun.jersey.server.impl.uri.rules.HttpMethodRule.accept(HttpMethodRule.java:302)
at com.sun.jersey.server.impl.uri.rules.RightHandPathRule.accept(RightHandPathRule.java:147)
at com.sun.jersey.server.impl.uri.rules.ResourceClassRule.accept(ResourceClassRule.java:108)
at com.sun.jersey.server.impl.uri.rules.RightHandPathRule.accept(RightHandPathRule.java:147)
at com.sun.jersey.server.impl.uri.rules.RootResourceClassesRule.accept(RootResourceClassesRule.java:84)
at com.sun.jersey.server.impl.application.WebApplicationImpl._handleRequest(WebApplicationImpl.java:1542)
at com.sun.jersey.server.impl.application.WebApplicationImpl._handleRequest(WebApplicationImpl.java:1473)
at com.sun.jersey.server.impl.application.WebApplicationImpl.handleRequest(WebApplicationImpl.java:1419)
at com.sun.jersey.server.impl.application.WebApplicationImpl.handleRequest(WebApplicationImpl.java:1409)
at com.sun.jersey.spi.container.servlet.WebComponent.service(WebComponent.java:409)
at com.sun.jersey.spi.container.servlet.ServletContainer.service(ServletContainer.java:558)
at com.sun.jersey.spi.container.servlet.ServletContainer.service(ServletContainer.java:733)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:790)
at com.google.inject.servlet.ServletDefinition.doServiceImpl(ServletDefinition.java:286)
at com.google.inject.servlet.ServletDefinition.doService(ServletDefinition.java:276)
at com.google.inject.servlet.ServletDefinition.service(ServletDefinition.java:181)
at com.google.inject.servlet.ManagedServletPipeline.service(ManagedServletPipeline.java:91)
at com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:85)
at com.google.inject.servlet.ManagedFilterPipeline.dispatch(ManagedFilterPipeline.java:120)
at com.google.inject.servlet.GuiceFilter.doFilter(GuiceFilter.java:135)
at org.eclipse.jetty.servlet.FilterHolder.doFilter(FilterHolder.java:193)
at org.eclipse.jetty.servlet.ServletHandler$Chain.doFilter(ServletHandler.java:1626)
at org.apache.druid.server.http.RedirectFilter.doFilter(RedirectFilter.java:73)
at org.eclipse.jetty.servlet.FilterHolder.doFilter(FilterHolder.java:193)
at org.eclipse.jetty.servlet.ServletHandler$Chain.doFilter(ServletHandler.java:1626)
at org.apache.druid.server.security.PreResponseAuthorizationCheckFilter.doFilter(PreResponseAuthorizationCheckFilter.java:82)
at org.eclipse.jetty.servlet.FilterHolder.doFilter(FilterHolder.java:193)
at org.eclipse.jetty.servlet.ServletHandler$Chain.doFilter(ServletHandler.java:1626)
at org.apache.druid.server.initialization.jetty.StandardResponseHeaderFilterHolder$StandardResponseHeaderFilter.doFilter(StandardResponseHeaderFilterHolder.java:161)
at org.eclipse.jetty.servlet.FilterHolder.doFilter(FilterHolder.java:193)
at org.eclipse.jetty.servlet.ServletHandler$Chain.doFilter(ServletHandler.java:1626)
at org.apache.druid.server.security.AllowHttpMethodsResourceFilter.doFilter(AllowHttpMethodsResourceFilter.java:78)
at org.eclipse.jetty.servlet.FilterHolder.doFilter(FilterHolder.java:193)
at org.eclipse.jetty.servlet.ServletHandler$Chain.doFilter(ServletHandler.java:1626)
at org.apache.druid.server.security.AllowOptionsResourceFilter.doFilter(AllowOptionsResourceFilter.java:75)
at org.eclipse.jetty.servlet.FilterHolder.doFilter(FilterHolder.java:193)
at org.eclipse.jetty.servlet.ServletHandler$Chain.doFilter(ServletHandler.java:1626)
at org.apache.druid.server.security.AllowAllAuthenticator$1.doFilter(AllowAllAuthenticator.java:84)
at org.apache.druid.server.security.AuthenticationWrappingFilter.doFilter(AuthenticationWrappingFilter.java:59)
at org.eclipse.jetty.servlet.FilterHolder.doFilter(FilterHolder.java:193)
at org.eclipse.jetty.servlet.ServletHandler$Chain.doFilter(ServletHandler.java:1626)
at org.apache.druid.server.security.SecuritySanityCheckFilter.doFilter(SecuritySanityCheckFilter.java:77)
at org.eclipse.jetty.servlet.FilterHolder.doFilter(FilterHolder.java:193)
at org.eclipse.jetty.servlet.ServletHandler$Chain.doFilter(ServletHandler.java:1626)
at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:552)
at org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:233)
at org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:1624)
at org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:233)
at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1440)
at org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:188)
at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:505)
at org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:1594)
at org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:186)
at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1355)
at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
at org.eclipse.jetty.server.handler.gzip.GzipHandler.handle(GzipHandler.java:772)
at org.eclipse.jetty.server.handler.HandlerList.handle(HandlerList.java:59)
at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:127)
at org.eclipse.jetty.server.Server.handle(Server.java:516)
at org.eclipse.jetty.server.HttpChannel.lambda$handle$1(HttpChannel.java:487)
at org.eclipse.jetty.server.HttpChannel$$Lambda$349/207742670.dispatch(Unknown Source)
at org.eclipse.jetty.server.HttpChannel.dispatch(HttpChannel.java:732)
at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:479)
at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:277)
at org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:311)
at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:105)
at org.eclipse.jetty.io.ssl.SslConnection$DecryptedEndPoint.onFillable(SslConnection.java:555)
at org.eclipse.jetty.io.ssl.SslConnection.onFillable(SslConnection.java:410)
at org.eclipse.jetty.io.ssl.SslConnection$2.succeeded(SslConnection.java:164)
at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:105)
at org.eclipse.jetty.io.ChannelEndPoint$1.run(ChannelEndPoint.java:104)
at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.runTask(EatWhatYouKill.java:338)
at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.doProduce(EatWhatYouKill.java:315)
at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.tryProduce(EatWhatYouKill.java:173)
at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.run(EatWhatYouKill.java:131)
at org.eclipse.jetty.util.thread.ReservedThreadExecutor$ReservedThread.run(ReservedThreadExecutor.java:409)
at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:883)
at org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:1034)
at java.lang.Thread.run(Thread.java:750)
行 627: - waiting to lock <0x00000000f56ddbd8> (a java.lang.Object)
行 756: - waiting to lock <0x00000000f56ddbd8> (a java.lang.Object)
行 886: - waiting to lock <0x00000000f56ddbd8> (a java.lang.Object)
行 982: - waiting to lock <0x00000000f56ddbd8> (a java.lang.Object)
行 1100: - waiting to lock <0x00000000f56ddbd8> (a java.lang.Object)
行 1196: - waiting to lock <0x00000000f56ddbd8> (a java.lang.Object)
行 1291: - waiting to lock <0x00000000f56ddbd8> (a java.lang.Object)
行 1410: - waiting to lock <0x00000000f56ddbd8> (a java.lang.Object)
行 1505: - waiting to lock <0x00000000f56ddbd8> (a java.lang.Object)
行 1622: - waiting to lock <0x00000000f56ddbd8> (a java.lang.Object)
行 1721: - locked <0x00000000f56ddbd8> (a java.lang.Object)
行 1842: - waiting to lock <0x00000000f56ddbd8> (a java.lang.Object)
行 1938: - waiting to lock <0x00000000f56ddbd8> (a java.lang.Object)
行 2034: - waiting to lock <0x00000000f56ddbd8> (a java.lang.Object)
行 2151: - waiting to lock <0x00000000f56ddbd8> (a java.lang.Object)
行 2283: - waiting to lock <0x00000000f56ddbd8> (a java.lang.Object)
行 2423: - waiting to lock <0x00000000f56ddbd8> (a java.lang.Object)
行 2519: - waiting to lock <0x00000000f56ddbd8> (a java.lang.Object)
行 2626: - waiting to lock <0x00000000f56ddbd8> (a java.lang.Object)
行 2733: - waiting to lock <0x00000000f56ddbd8> (a java.lang.Object)
行 2839: - waiting to lock <0x00000000f56ddbd8> (a java.lang.Object)
行 2959: - waiting to lock <0x00000000f56ddbd8> (a java.lang.Object)
行 3054: - waiting to lock <0x00000000f56ddbd8> (a java.lang.Object)
行 3172: - waiting to lock <0x00000000f56ddbd8> (a java.lang.Object)
行 3300: - waiting to lock <0x00000000f56ddbd8> (a java.lang.Object)
行 3406: - waiting to lock <0x00000000f56ddbd8> (a java.lang.Object)
行 3502: - waiting to lock <0x00000000f56ddbd8> (a java.lang.Object)
Any debugging that you have already done
This problem occurs when there are three nodes in a Kafka cluster, but one or two of them are unavailable.
There are a large number of timeout logs in the overlord.log as follows:
org.apache.druid.indexing.seekablestream.common.StreamException: org.apache.kafka.common.errors.TimeoutException: Timeout of 60000ms expired before the position for partition ODAEDATASET._DEFAULT.xxxxxx._DEFAULT-0 could be determined
at org.apache.druid.indexing.kafka.KafkaRecordSupplier.wrapExceptions(KafkaRecordSupplier.java:328) ~[?:?]
at org.apache.druid.indexing.kafka.KafkaRecordSupplier.getPosition(KafkaRecordSupplier.java:182) ~[?:?]
at org.apache.druid.indexing.kafka.KafkaRecordSupplier.getEarliestSequenceNumber(KafkaRecordSupplier.java:164) ~[?:?]
at org.apache.druid.indexing.kafka.KafkaRecordSupplier.isOffsetAvailable(KafkaRecordSupplier.java:174) ~[?:?]
at org.apache.druid.indexing.seekablestream.supervisor.SeekableStreamSupervisor.checkOffsetAvailability(SeekableStreamSupervisor.java:4114) ~[druid-indexing-service-24.0.1-htrunk17.jar:?]
at org.apache.druid.indexing.seekablestream.supervisor.SeekableStreamSupervisor.getOffsetFromStorageForPartition(SeekableStreamSupervisor.java:3570) ~[druid-indexing-service-24.0.1-htrunk17.jar:?]
at org.apache.druid.indexing.seekablestream.supervisor.SeekableStreamSupervisor.generateStartingSequencesForPartitionGroup(SeekableStreamSupervisor.java:3548) ~[druid-indexing-service-24.0.1-htrunk17.jar:?]
at org.apache.druid.indexing.seekablestream.supervisor.SeekableStreamSupervisor.createNewTasks(SeekableStreamSupervisor.java:3357) ~[druid-indexing-service-24.0.1-htrunk17.jar:?]
at org.apache.druid.indexing.seekablestream.supervisor.SeekableStreamSupervisor.runInternal(SeekableStreamSupervisor.java:1490) ~[druid-indexing-service-24.0.1-htrunk17.jar:?]
at org.apache.druid.indexing.seekablestream.supervisor.SeekableStreamSupervisor$RunNotice.handle(SeekableStreamSupervisor.java:403) ~[druid-indexing-service-24.0.1-htrunk17.jar:?]
at org.apache.druid.indexing.seekablestream.supervisor.SeekableStreamSupervisor.lambda$tryInit$3(SeekableStreamSupervisor.java:1049) ~[druid-indexing-service-24.0.1-htrunk17.jar:?]
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) ~[?:1.8.0_382]
at java.util.concurrent.FutureTask.run(FutureTask.java:266) ~[?:1.8.0_382]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~[?:1.8.0_382]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ~[?:1.8.0_382]
at java.lang.Thread.run(Thread.java:750) ~[?:1.8.0_382]
The text was updated successfully, but these errors were encountered:
This issue has been marked as stale due to 280 days of inactivity.
It will be closed in 4 weeks if no further activity occurs. If this issue is still
relevant, please simply write any comment. Even if closed, you can still revive the
issue at any time or discuss it on the [email protected] list.
Thank you for your contributions.
This issue has been closed due to lack of activity. If you think that
is incorrect, or the issue requires additional review, you can revive the issue at
any time.
Overlord cannot respond to requests when kafka is abnormal
Affected Version
28.0.1
Description
Please include as much detailed information about the problem as possible.
Cluster size
3-node cluster
Configurations in use
cluster configuration
Steps to reproduce the problem
Stop one or two kafka nodes in the Kafka cluster and stop or start a large number of supervisors
The error message or stack traces encountered. Providing more context, such as nearby log messages or even entire logs, can be helpful.

Overlord experienced a large number of connection timeouts
This problem occurs when there are three nodes in a Kafka cluster, but one or two of them are unavailable.
There are a large number of timeout logs in the overlord.log as follows:
The text was updated successfully, but these errors were encountered: