Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CELEBORN-1912] ReadClientHandler should support reconnection for ChannelInboundHandler#channelInactive #3154

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

SteNicholas
Copy link
Member

@SteNicholas SteNicholas commented Mar 14, 2025

What changes were proposed in this pull request?

ReadClientHandler should support reconnection for ChannelInboundHandler#channelInactive.

Why are the changes needed?

ReadClientHandler should support reconnection for ChannelInboundHandler#channelInactive to avoid frequent failover because of the below exception:

2025-03-10 15:20:54
java.io.IOException: Client /11.35.44.216:9093 is lost, notify related stream 152545797850
	at org.apache.celeborn.plugin.flink.RemoteBufferStreamReader.errorReceived(RemoteBufferStreamReader.java:146)
	at org.apache.celeborn.plugin.flink.RemoteBufferStreamReader.lambda$new$0(RemoteBufferStreamReader.java:77)
	at org.apache.celeborn.plugin.flink.network.ReadClientHandler.processMessageInternal(ReadClientHandler.java:64)
	at org.apache.celeborn.plugin.flink.network.ReadClientHandler.lambda$channelInactive$0(ReadClientHandler.java:145)
	at java.base/java.util.concurrent.ConcurrentHashMap.forEach(ConcurrentHashMap.java:1603)
	at org.apache.celeborn.plugin.flink.network.ReadClientHandler.channelInactive(ReadClientHandler.java:136)
	at org.apache.celeborn.common.network.server.TransportRequestHandler.channelInactive(TransportRequestHandler.java:74)
	at org.apache.celeborn.common.network.server.TransportChannelHandler.channelInactive(TransportChannelHandler.java:141)
	at org.apache.celeborn.shaded.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:303)
	at org.apache.celeborn.shaded.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:281)
	at org.apache.celeborn.shaded.io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:274)
	at org.apache.celeborn.shaded.io.netty.channel.ChannelInboundHandlerAdapter.channelInactive(ChannelInboundHandlerAdapter.java:81)
	at org.apache.celeborn.shaded.io.netty.handler.timeout.IdleStateHandler.channelInactive(IdleStateHandler.java:280)
	at org.apache.celeborn.shaded.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:303)
	at org.apache.celeborn.shaded.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:281)
	at org.apache.celeborn.shaded.io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:274)
	at org.apache.celeborn.shaded.io.netty.channel.ChannelInboundHandlerAdapter.channelInactive(ChannelInboundHandlerAdapter.java:81)
	at org.apache.celeborn.plugin.flink.network.TransportFrameDecoderWithBufferSupplier.channelInactive(TransportFrameDecoderWithBufferSupplier.java:207)
	at org.apache.celeborn.shaded.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:303)
	at org.apache.celeborn.shaded.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:281)
	at org.apache.celeborn.shaded.io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:274)
	at org.apache.celeborn.shaded.io.netty.channel.DefaultChannelPipeline$HeadContext.channelInactive(DefaultChannelPipeline.java:1405)
	at org.apache.celeborn.shaded.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:301)
	at org.apache.celeborn.shaded.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:281)
	at org.apache.celeborn.shaded.io.netty.channel.DefaultChannelPipeline.fireChannelInactive(DefaultChannelPipeline.java:901)
	at org.apache.celeborn.shaded.io.netty.channel.AbstractChannel$AbstractUnsafe$7.run(AbstractChannel.java:813)
	at org.apache.celeborn.shaded.io.netty.util.concurrent.AbstractEventExecutor.runTask(AbstractEventExecutor.java:173)
	at org.apache.celeborn.shaded.io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:166)
	at org.apache.celeborn.shaded.io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:470)
	at org.apache.celeborn.shaded.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:566)
	at org.apache.celeborn.shaded.io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997)
	at org.apache.celeborn.shaded.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
	at org.apache.celeborn.shaded.io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
	at java.base/java.lang.Thread.run(Thread.java:991)

Does this PR introduce any user-facing change?

Introduce celeborn.<module>.reconnect.maxRetries and celeborn.<module>.reconnect.retryWait to support reconnection of client.

How was this patch tested?

TransportClientFactorySuiteJ#testChannelInactiveReconnect

@SteNicholas SteNicholas force-pushed the CELEBORN-1912 branch 4 times, most recently from c67475f to 12867fe Compare March 17, 2025 03:04
@SteNicholas SteNicholas marked this pull request as draft March 17, 2025 03:17
@SteNicholas SteNicholas force-pushed the CELEBORN-1912 branch 4 times, most recently from d0416a1 to 0f1d24c Compare March 17, 2025 03:45
@SteNicholas SteNicholas marked this pull request as ready for review March 17, 2025 03:50
@SteNicholas
Copy link
Member Author

Ping @reswqa, @codenohup.

@SteNicholas SteNicholas requested a review from reswqa March 17, 2025 08:02
@SteNicholas SteNicholas requested review from FMX and RexXiong March 17, 2025 08:26
@FMX
Copy link
Contributor

FMX commented Mar 17, 2025

Although you can reconnect the connection, there might be something wrong with your data because rebuilding the connection may cause wrong data order.

@FMX
Copy link
Contributor

FMX commented Mar 17, 2025

Although you can reconnect the connection, there might be something wrong with your data because rebuilding the connection may cause wrong data order.

Forget about this, I found that you have rebuilt the connection used to read shuffle from Celeborn which shall be OK.

@SteNicholas SteNicholas requested a review from FMX March 18, 2025 08:55
@SteNicholas SteNicholas force-pushed the CELEBORN-1912 branch 2 times, most recently from 0ce14a2 to 3c9336e Compare March 18, 2025 13:09
Copy link
Contributor

@FMX FMX left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@SteNicholas SteNicholas marked this pull request as draft March 19, 2025 08:06
@SteNicholas SteNicholas marked this pull request as ready for review March 19, 2025 11:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants