-
-
Notifications
You must be signed in to change notification settings - Fork 547
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
v1.16.0 - Lossing data when one node of the cluster is lost #1303
Comments
Hi there, I've debugged a little bit more the issue, I don't know the code enough well, but I think I have found where is the problem. In the sendMessages file, the responses of the brokers is tracked by the kafkajs/src/producer/sendMessages.js Lines 20 to 26 in 1876abc
If the request is fulfilled, the response is setted, but if there is a failure in the response, the response from the broker is deleted. kafkajs/src/producer/sendMessages.js Lines 116 to 121 in 1876abc
This logic is okey when you have a problem in the request but the broker is still there. For that cases where the broker is lost (from 3 brokers in a cluster to 2), in the next retry of the kafkajs/src/producer/sendMessages.js Lines 65 to 69 in 1876abc
So, the kafkajs/src/producer/sendMessages.js Lines 132 to 136 in 1876abc
I think this logic should be based in partitiions instead of brokers. Let me know your thoughts |
This is clearly a bug.
So if your analysis is correct, then I suppose what's needed is to during retries detect that we still have outstanding partitions to produce to, check who the new leader for that partition is and then issue new requests. I'm not sure about the cluster behavior during this operation though. If it's a graceful shutdown of a broker in the cluster, I would expect leadership to change to one of the other brokers, in which case the above approach should work, but if it's an ungraceful shutdown I'm not sure that this will make a difference because the automatic leadership rebalance only kicks in after 5 minutes by default. Regardless, at least we should be able to throw a meaningful error in the case we can't find a leader for the partition, rather than silently fail as is the case currently. |
Hi @Nevon, As you say there will be a complete different behaviour depending on the server "shutdown style". This is the way that some SDKs use to solve this kind of issues. For example this is the response of AWS SDK for Kinesis Client or Firehose Client: {
"FailedRecordCount": 2,
"Records": [
{
"SequenceNumber": "49543463076548007577105092703039560359975228518395012686",
"ShardId": "shardId-000000000000"
},
{
"ErrorCode": "ProvisionedThroughputExceededException",
"ErrorMessage": "Rate exceeded for shard shardId-000000000001 in stream exampleStreamName under account 111111111111."
},
{
"ErrorCode": "InternalFailure",
"ErrorMessage": "Internal service failure."
}
]
} Of course we must the select the correct format to be sure that we don't broke any previous implementation. |
I think if we have any failures that we couldn't handle, we should reject |
If the |
Check it out, this issue has come up before #43. Let's use that instead. |
Actually, nevermind, this deserves its own issue. Part of it is to communicate the partial failure better to the user, but another part is handling the error to begin with. Realized this one millisecond after closing. |
As extended error classes are currently be using in the code, we have to decide if the correct approach is to resolve with more info or maybe, reject with a special error. At the same time, we should use |
There is a all settled() implementation in /utils.
…On Thu, 10 Mar 2022, 23:09 Carlos Jiménez Saiz, ***@***.***> wrote:
As extended error classes are currently be using in the code, we have to
decide if the correct approach is to resolve with more info or maybe,
reject with a special error.
At the same time, we should use Promise.allSettled()instead of
Promise.all(), to be sure that we control which messages were commited,
here we need to take into account that Nodejs 12.9.0 will be necessary.
—
Reply to this email directly, view it on GitHub
<#1303 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABDLW5T2IF327FH7TXLIB7LU7IGD7ANCNFSM5PQWJ33A>
.
You are receiving this because you are subscribed to this thread.Message
ID: ***@***.***>
|
Hi, |
Hi, no problem from my side, I have sudden rush of work and I have deferred this task with no forecast date |
Sorry, any movement on this? Sounds like something that might happen on any production system that uses this library |
So basically this issue means that once a node is down (like when MSK clusters go through security patches, which are frequent) we may end up losing messages? |
@eladchen Yes |
Any update on this? We are also facing this critical issue. It's been 2y since first reported, would be great to have a fix. |
Describe the bug
I open this issue to clarify if what I see is correct and must be solve by the user in some way, or is a bug.
The specific configuration that i have tested:
confluentinc/cp-kafka
docker images. All of them with this configuration:send
method. The configuration of this producer is:I log every job (the result of a send process) and in the normal operation I can see something like:
In the middle of the test I stop one of the servers (not the controller of the cluster) with
docker stop 85fsa...
, and in the most of cases, not always, this is the log that i receive:As you can see the send to one of the servers failed, but the
send
method resolved with the response of only 2 server, and the send its not retried.In the last job process you can see that the final result is that we have lost 20 messages without warning from the library:
The snipped of code that I use to reproduce this behaviour is:
And the docker-compose file:
Expected behavior
I expect the method rejects with an error in order to retry the operations by my shelve or try to find the server that its the responsible of the partition now.
Observed behavior
Even one of the servers fails in the middle of the process of sending new records to the cluster, the
send
method resolves without error and the data is lost.Environment:
The text was updated successfully, but these errors were encountered: