Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Overlord cannot respond to requests when kafka is abnormal #16272

Closed
soullkk opened this issue Apr 12, 2024 · 2 comments
Closed

Overlord cannot respond to requests when kafka is abnormal #16272

soullkk opened this issue Apr 12, 2024 · 2 comments

Comments

@soullkk
Copy link
Contributor

soullkk commented Apr 12, 2024

Overlord cannot respond to requests when kafka is abnormal

Affected Version

28.0.1

Description

Please include as much detailed information about the problem as possible.

  • Cluster size
    3-node cluster

  • Configurations in use
    cluster configuration

  • Steps to reproduce the problem
    Stop one or two kafka nodes in the Kafka cluster and stop or start a large number of supervisors

  • The error message or stack traces encountered. Providing more context, such as nearby log messages or even entire logs, can be helpful.
    Overlord experienced a large number of connection timeouts
    image

"qtp1009260571-132" #132 daemon prio=5 os_prio=0 cpu=110.66ms elapsed=14488.50s tid=0x000055cf4f39c800 nid=0x1a96a7 waiting for monitor entry [0x00007f5e36c12000]
   java.lang.Thread.State: BLOCKED (on object monitor)
	at org.apache.druid.indexing.seekablestream.supervisor.SeekableStreamSupervisor.stop(SeekableStreamSupervisor.java:933)
	- waiting to lock <0x00000000eb35fae0> (a java.lang.Object)
	at org.apache.druid.indexing.overlord.supervisor.SupervisorManager.possiblyStopAndRemoveSupervisorInternal(SupervisorManager.java:268)
	at org.apache.druid.indexing.overlord.supervisor.SupervisorManager.stopAndRemoveSupervisor(SupervisorManager.java:105)
	- locked <0x00000000f56ddbd8> (a java.lang.Object)
	at org.apache.druid.indexing.overlord.supervisor.SupervisorResource.lambda$terminate$9(SupervisorResource.java:348)
	at org.apache.druid.indexing.overlord.supervisor.SupervisorResource$$Lambda$731/574536258.apply(Unknown Source)
	at org.apache.druid.indexing.overlord.supervisor.SupervisorResource.asLeaderWithSupervisorManager(SupervisorResource.java:476)
	at org.apache.druid.indexing.overlord.supervisor.SupervisorResource.terminate(SupervisorResource.java:346)
	at org.apache.druid.indexing.overlord.supervisor.SupervisorResource.shutdown(SupervisorResource.java:337)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at com.sun.jersey.spi.container.JavaMethodInvokerFactory$1.invoke(JavaMethodInvokerFactory.java:60)
	at com.sun.jersey.server.impl.model.method.dispatch.AbstractResourceMethodDispatchProvider$ResponseOutInvoker._dispatch(AbstractResourceMethodDispatchProvider.java:205)
	at com.sun.jersey.server.impl.model.method.dispatch.ResourceJavaMethodDispatcher.dispatch(ResourceJavaMethodDispatcher.java:75)
	at com.sun.jersey.server.impl.uri.rules.HttpMethodRule.accept(HttpMethodRule.java:302)
	at com.sun.jersey.server.impl.uri.rules.RightHandPathRule.accept(RightHandPathRule.java:147)
	at com.sun.jersey.server.impl.uri.rules.ResourceClassRule.accept(ResourceClassRule.java:108)
	at com.sun.jersey.server.impl.uri.rules.RightHandPathRule.accept(RightHandPathRule.java:147)
	at com.sun.jersey.server.impl.uri.rules.RootResourceClassesRule.accept(RootResourceClassesRule.java:84)
	at com.sun.jersey.server.impl.application.WebApplicationImpl._handleRequest(WebApplicationImpl.java:1542)
	at com.sun.jersey.server.impl.application.WebApplicationImpl._handleRequest(WebApplicationImpl.java:1473)
	at com.sun.jersey.server.impl.application.WebApplicationImpl.handleRequest(WebApplicationImpl.java:1419)
	at com.sun.jersey.server.impl.application.WebApplicationImpl.handleRequest(WebApplicationImpl.java:1409)
	at com.sun.jersey.spi.container.servlet.WebComponent.service(WebComponent.java:409)
	at com.sun.jersey.spi.container.servlet.ServletContainer.service(ServletContainer.java:558)
	at com.sun.jersey.spi.container.servlet.ServletContainer.service(ServletContainer.java:733)
	at javax.servlet.http.HttpServlet.service(HttpServlet.java:790)
	at com.google.inject.servlet.ServletDefinition.doServiceImpl(ServletDefinition.java:286)
	at com.google.inject.servlet.ServletDefinition.doService(ServletDefinition.java:276)
	at com.google.inject.servlet.ServletDefinition.service(ServletDefinition.java:181)
	at com.google.inject.servlet.ManagedServletPipeline.service(ManagedServletPipeline.java:91)
	at com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:85)
	at com.google.inject.servlet.ManagedFilterPipeline.dispatch(ManagedFilterPipeline.java:120)
	at com.google.inject.servlet.GuiceFilter.doFilter(GuiceFilter.java:135)
	at org.eclipse.jetty.servlet.FilterHolder.doFilter(FilterHolder.java:193)
	at org.eclipse.jetty.servlet.ServletHandler$Chain.doFilter(ServletHandler.java:1626)
	at org.apache.druid.server.http.RedirectFilter.doFilter(RedirectFilter.java:73)
	at org.eclipse.jetty.servlet.FilterHolder.doFilter(FilterHolder.java:193)
	at org.eclipse.jetty.servlet.ServletHandler$Chain.doFilter(ServletHandler.java:1626)
	at org.apache.druid.server.security.PreResponseAuthorizationCheckFilter.doFilter(PreResponseAuthorizationCheckFilter.java:82)
	at org.eclipse.jetty.servlet.FilterHolder.doFilter(FilterHolder.java:193)
	at org.eclipse.jetty.servlet.ServletHandler$Chain.doFilter(ServletHandler.java:1626)
	at org.apache.druid.server.initialization.jetty.StandardResponseHeaderFilterHolder$StandardResponseHeaderFilter.doFilter(StandardResponseHeaderFilterHolder.java:161)
	at org.eclipse.jetty.servlet.FilterHolder.doFilter(FilterHolder.java:193)
	at org.eclipse.jetty.servlet.ServletHandler$Chain.doFilter(ServletHandler.java:1626)
	at org.apache.druid.server.security.AllowHttpMethodsResourceFilter.doFilter(AllowHttpMethodsResourceFilter.java:78)
	at org.eclipse.jetty.servlet.FilterHolder.doFilter(FilterHolder.java:193)
	at org.eclipse.jetty.servlet.ServletHandler$Chain.doFilter(ServletHandler.java:1626)
	at org.apache.druid.server.security.AllowOptionsResourceFilter.doFilter(AllowOptionsResourceFilter.java:75)
	at org.eclipse.jetty.servlet.FilterHolder.doFilter(FilterHolder.java:193)
	at org.eclipse.jetty.servlet.ServletHandler$Chain.doFilter(ServletHandler.java:1626)
	at org.apache.druid.server.security.AllowAllAuthenticator$1.doFilter(AllowAllAuthenticator.java:84)
	at org.apache.druid.server.security.AuthenticationWrappingFilter.doFilter(AuthenticationWrappingFilter.java:59)
	at org.eclipse.jetty.servlet.FilterHolder.doFilter(FilterHolder.java:193)
	at org.eclipse.jetty.servlet.ServletHandler$Chain.doFilter(ServletHandler.java:1626)
	at org.apache.druid.server.security.SecuritySanityCheckFilter.doFilter(SecuritySanityCheckFilter.java:77)
	at org.eclipse.jetty.servlet.FilterHolder.doFilter(FilterHolder.java:193)
	at org.eclipse.jetty.servlet.ServletHandler$Chain.doFilter(ServletHandler.java:1626)
	at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:552)
	at org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:233)
	at org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:1624)
	at org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:233)
	at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1440)
	at org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:188)
	at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:505)
	at org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:1594)
	at org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:186)
	at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1355)
	at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
	at org.eclipse.jetty.server.handler.gzip.GzipHandler.handle(GzipHandler.java:772)
	at org.eclipse.jetty.server.handler.HandlerList.handle(HandlerList.java:59)
	at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:127)
	at org.eclipse.jetty.server.Server.handle(Server.java:516)
	at org.eclipse.jetty.server.HttpChannel.lambda$handle$1(HttpChannel.java:487)
	at org.eclipse.jetty.server.HttpChannel$$Lambda$349/207742670.dispatch(Unknown Source)
	at org.eclipse.jetty.server.HttpChannel.dispatch(HttpChannel.java:732)
	at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:479)
	at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:277)
	at org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:311)
	at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:105)
	at org.eclipse.jetty.io.ssl.SslConnection$DecryptedEndPoint.onFillable(SslConnection.java:555)
	at org.eclipse.jetty.io.ssl.SslConnection.onFillable(SslConnection.java:410)
	at org.eclipse.jetty.io.ssl.SslConnection$2.succeeded(SslConnection.java:164)
	at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:105)
	at org.eclipse.jetty.io.ChannelEndPoint$1.run(ChannelEndPoint.java:104)
	at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.runTask(EatWhatYouKill.java:338)
	at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.doProduce(EatWhatYouKill.java:315)
	at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.tryProduce(EatWhatYouKill.java:173)
	at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.run(EatWhatYouKill.java:131)
	at org.eclipse.jetty.util.thread.ReservedThreadExecutor$ReservedThread.run(ReservedThreadExecutor.java:409)
	at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:883)
	at org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:1034)
	at java.lang.Thread.run(Thread.java:750)
	行 627: 	- waiting to lock <0x00000000f56ddbd8> (a java.lang.Object)
	行 756: 	- waiting to lock <0x00000000f56ddbd8> (a java.lang.Object)
	行 886: 	- waiting to lock <0x00000000f56ddbd8> (a java.lang.Object)
	行 982: 	- waiting to lock <0x00000000f56ddbd8> (a java.lang.Object)
	行 1100: 	- waiting to lock <0x00000000f56ddbd8> (a java.lang.Object)
	行 1196: 	- waiting to lock <0x00000000f56ddbd8> (a java.lang.Object)
	行 1291: 	- waiting to lock <0x00000000f56ddbd8> (a java.lang.Object)
	行 1410: 	- waiting to lock <0x00000000f56ddbd8> (a java.lang.Object)
	行 1505: 	- waiting to lock <0x00000000f56ddbd8> (a java.lang.Object)
	行 1622: 	- waiting to lock <0x00000000f56ddbd8> (a java.lang.Object)
	行 1721: 	- locked <0x00000000f56ddbd8> (a java.lang.Object)
	行 1842: 	- waiting to lock <0x00000000f56ddbd8> (a java.lang.Object)
	行 1938: 	- waiting to lock <0x00000000f56ddbd8> (a java.lang.Object)
	行 2034: 	- waiting to lock <0x00000000f56ddbd8> (a java.lang.Object)
	行 2151: 	- waiting to lock <0x00000000f56ddbd8> (a java.lang.Object)
	行 2283: 	- waiting to lock <0x00000000f56ddbd8> (a java.lang.Object)
	行 2423: 	- waiting to lock <0x00000000f56ddbd8> (a java.lang.Object)
	行 2519: 	- waiting to lock <0x00000000f56ddbd8> (a java.lang.Object)
	行 2626: 	- waiting to lock <0x00000000f56ddbd8> (a java.lang.Object)
	行 2733: 	- waiting to lock <0x00000000f56ddbd8> (a java.lang.Object)
	行 2839: 	- waiting to lock <0x00000000f56ddbd8> (a java.lang.Object)
	行 2959: 	- waiting to lock <0x00000000f56ddbd8> (a java.lang.Object)
	行 3054: 	- waiting to lock <0x00000000f56ddbd8> (a java.lang.Object)
	行 3172: 	- waiting to lock <0x00000000f56ddbd8> (a java.lang.Object)
	行 3300: 	- waiting to lock <0x00000000f56ddbd8> (a java.lang.Object)
	行 3406: 	- waiting to lock <0x00000000f56ddbd8> (a java.lang.Object)
	行 3502: 	- waiting to lock <0x00000000f56ddbd8> (a java.lang.Object)
  • Any debugging that you have already done
    This problem occurs when there are three nodes in a Kafka cluster, but one or two of them are unavailable.
    There are a large number of timeout logs in the overlord.log as follows:
org.apache.druid.indexing.seekablestream.common.StreamException: org.apache.kafka.common.errors.TimeoutException: Timeout of 60000ms expired before the position for partition ODAEDATASET._DEFAULT.xxxxxx._DEFAULT-0 could be determined
	at org.apache.druid.indexing.kafka.KafkaRecordSupplier.wrapExceptions(KafkaRecordSupplier.java:328) ~[?:?]
	at org.apache.druid.indexing.kafka.KafkaRecordSupplier.getPosition(KafkaRecordSupplier.java:182) ~[?:?]
	at org.apache.druid.indexing.kafka.KafkaRecordSupplier.getEarliestSequenceNumber(KafkaRecordSupplier.java:164) ~[?:?]
	at org.apache.druid.indexing.kafka.KafkaRecordSupplier.isOffsetAvailable(KafkaRecordSupplier.java:174) ~[?:?]
	at org.apache.druid.indexing.seekablestream.supervisor.SeekableStreamSupervisor.checkOffsetAvailability(SeekableStreamSupervisor.java:4114) ~[druid-indexing-service-24.0.1-htrunk17.jar:?]
	at org.apache.druid.indexing.seekablestream.supervisor.SeekableStreamSupervisor.getOffsetFromStorageForPartition(SeekableStreamSupervisor.java:3570) ~[druid-indexing-service-24.0.1-htrunk17.jar:?]
	at org.apache.druid.indexing.seekablestream.supervisor.SeekableStreamSupervisor.generateStartingSequencesForPartitionGroup(SeekableStreamSupervisor.java:3548) ~[druid-indexing-service-24.0.1-htrunk17.jar:?]
	at org.apache.druid.indexing.seekablestream.supervisor.SeekableStreamSupervisor.createNewTasks(SeekableStreamSupervisor.java:3357) ~[druid-indexing-service-24.0.1-htrunk17.jar:?]
	at org.apache.druid.indexing.seekablestream.supervisor.SeekableStreamSupervisor.runInternal(SeekableStreamSupervisor.java:1490) ~[druid-indexing-service-24.0.1-htrunk17.jar:?]
	at org.apache.druid.indexing.seekablestream.supervisor.SeekableStreamSupervisor$RunNotice.handle(SeekableStreamSupervisor.java:403) ~[druid-indexing-service-24.0.1-htrunk17.jar:?]
	at org.apache.druid.indexing.seekablestream.supervisor.SeekableStreamSupervisor.lambda$tryInit$3(SeekableStreamSupervisor.java:1049) ~[druid-indexing-service-24.0.1-htrunk17.jar:?]
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) ~[?:1.8.0_382]
	at java.util.concurrent.FutureTask.run(FutureTask.java:266) ~[?:1.8.0_382]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~[?:1.8.0_382]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ~[?:1.8.0_382]
	at java.lang.Thread.run(Thread.java:750) ~[?:1.8.0_382]
Copy link

This issue has been marked as stale due to 280 days of inactivity.
It will be closed in 4 weeks if no further activity occurs. If this issue is still
relevant, please simply write any comment. Even if closed, you can still revive the
issue at any time or discuss it on the [email protected] list.
Thank you for your contributions.

@github-actions github-actions bot added the stale label Jan 18, 2025
Copy link

This issue has been closed due to lack of activity. If you think that
is incorrect, or the issue requires additional review, you can revive the issue at
any time.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Feb 15, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant