-
Notifications
You must be signed in to change notification settings - Fork 3.9k
DO NOT MERGE When a queue bq_init fails, its state is not updated #14120
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
DO NOT MERGE When a queue bq_init fails, its state is not updated #14120
Conversation
If a queue bq_init fails, its 'backing_queue_state' is 'undefined', while its state in mnesia remains 'live' due to the inability to update. In fact, its state in mnesia should be updated to 'crashed' in the 'terminate' function. Here is an example: 2025-04-19 16:39:24.535 [error] <0.4827.0> supervisor: {<0.4827.0>,rabbit_amqqueue_sup}, errorContext: child_terminated, reason: {function_clause, [{rabbit_queue_index,segment_plus_journal1, [undefined,{no_pub,no_del,ack}], [{file,"rabbit_queue_index.erl"},{line,1344}]}, {rabbit_queue_index,'-segment_plus_journal/2-fun-0-',3, [{file,"rabbit_queue_index.erl"},{line,1331}]}, {array,sparse_foldl_3,7,[{file,"array.erl"},{line,1687}]}, {array,sparse_foldl_2,9,[{file,"array.erl"},{line,1681}]}, {rabbit_queue_index,segment_entries_foldr,3, [{file,"rabbit_queue_index.erl"},{line,1230}]}, {rabbit_queue_index,read_bounded_segment,5, [{file,"rabbit_queue_index.erl"},{line,1211}]}, {rabbit_queue_index,read,3, [{file,"rabbit_queue_index.erl"},{line,500}]}, {rabbit_variable_queue,maybe_deltas_to_betas,3, [{file,"rabbit_variable_queue.erl"},{line,2635}]}]}, offender: [{pid,<0.519784.0>}, {id,rabbit_amqqueue}, {mfargs, {rabbit_prequeue,start_link, [{amqqueue, {resource,<<"/">>,queue, <<"versioned_notifications.info">>}, false,false,none,[],none,[],[],[], [{vhost,<<"/">>}, {name,<<"ttl">>}, {pattern,<<".*">>}, {'apply-to',<<"queues">>}, {definition,[{<<"message-ttl">>,600000}]}, {priority,0}], undefined,[],[],live,0,[],<<"/">>, #{user => <<"guest">>, system_creation => 1744865495083934694, recover_on_declare => false, creator =>, {1744865646,"[2111:222::144]",54011, "nova-monitor"}}, rabbit_classic_queue,#{}}, {declare,new}, <0.4826.0>]}}, {restart_type,transient}, {significant,true}, {shutdown,600000}, {child_type,worker}], 2025-04-19 16:39:25.619 [error] <0.534924.0> Restarting crashed queue 'versioned_notifications.info' in vhost '/'. 2025-04-19 16:39:25.633 [error] <0.534924.0> ** Generic server <0.534924.0> terminating, ** Last message in was {'$gen_cast', {init,{recovery,{no_barrier,non_clean_shutdown}}}}, ** When Server state == {q,{amqqueue, {resource,<<"/">>,queue, <<"versioned_notifications.info">>}, false,false,none,[],<0.534924.0>,[],[],[], [{vhost,<<"/">>}, {name,<<"ttl">>}, {pattern,<<".*">>}, {'apply-to',<<"queues">>}, {definition,[{<<"message-ttl">>,600000}]}, {priority,0}], undefined,[],[],live,0,[],<<"/">>, #{user => <<"guest">>, system_creation => 1744865495083934694, recover_on_declare => false, creator =>, {1744865646,"[2111:222::144]",54011, "nova-monitor"}}, rabbit_classic_queue,#{}}, none,false,undefined,undefined, {state, {queue,[],[],0}, {active,-576274256314339,1.0}, []}, undefined,undefined,undefined,undefined, {state,none,30000,undefined}, #{},undefined,undefined,undefined, {state,#{},delegate}, undefined,undefined,undefined,undefined, 'drop-head',0,0,running,false,0,undefined,<<"/">>, undefined,0,true,false,false,0,undefined,0,0,0,[]}, ** Reason for termination == , ** {badarg,[{erlang,binary_to_term, [<<131,104,6,119,13,98,97,115,105,99,95,109,101,115,115, 97,103,101,104,4,119,8,114,101,115,111,117,114,99,101, 109,0,0,0,1,47,119,8,101,120,99,104,97,110,103,101,109, 0,0,0,4,110,111,118,97,108,0,0,0,1,109,0,0,0,28,118, 101,114,115,105,111,110,101,100,95,110,111,116,105,102, 105,99,97,116,105,111,110,115,46,105,110,102,111,106, 104,7,119,7,99,111,110,116,101,110,116,97,60,119,4,110, 111,110,101,109,0,0,0,31,248,0,16,97,112,112,108,105, 99,97,116,105,111,110,47,106,115,111,110,5,117,116,102, 45,56,0,0,0,0,2,0,119,25,114,97,98,98,105,116,95,102, 114,97,109,105,110,103,95,97,109,113,112,95,48,95,57, 95,49,108,0,0,0,1,109,0,0 2025-04-19 16:39:27.264 [error] <0.4827.0> supervisor: {<0.4827.0>,rabbit_amqqueue_sup}, errorContext: shutdown, reason: reached_max_restart_intensity, offender: [{pid,<0.534938.0>}, {id,rabbit_amqqueue}, {mfargs, {rabbit_prequeue,start_link, [{amqqueue, {resource,<<"/">>,queue, <<"versioned_notifications.info">>}, false,false,none,[],none,[],[],[], [{vhost,<<"/">>}, {name,<<"ttl">>}, {pattern,<<".*">>}, {'apply-to',<<"queues">>}, {definition,[{<<"message-ttl">>,600000}]}, {priority,0}], undefined,[],[],live,0,[],<<"/">>, #{user => <<"guest">>, system_creation => 1744865495083934694, recover_on_declare => false, creator =>, {1744865646,"[2111:222::144]",54011, "nova-monitor"}}, rabbit_classic_queue,#{}}, {declare,new}, <0.4826.0>]}}, {restart_type,transient}, {significant,true}, {shutdown,600000}, {child_type,worker}], 2025-04-19 16:41:24.807 [error] <0.535419.0> Channel error on connection <0.535410.0> ([2111:222::116]:55611 -> [2111:222::13f]:5672, vhost: '/', user: 'guest'), channel 1:, operation queue.declare caused a channel exception not_found: failed to perform operation on queue 'versioned_notifications.info' in vhost '/' due to timeout
A test would be good to have to confirm that the claim about Mnesia is an issue, and that this fixes it. Or at least a way to reproduce. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The number of test suites that fail is so varied and substantial that I highly doubt that it can be attributed to flakes or the fact that some suites require access to GitHub secrets.
If a queue bq_init fails, its 'backing_queue_state' is 'undefined', while its state in mnesia remains 'live' due to the inability to update. In fact, its state in mnesia should be updated to 'crashed' in the 'terminate' function.
Sorry, I tried several times but couldn't reproduce the issue.
The logs posted earlier occurred in our production environment. From the logs, it appears that the queue_index file of that queue was corrupted, causing an exception when qpid tried to parse it.
Please allow me to explain the issue.
// Log of the first exception occurrence:
2025-04-19 16:39:24.467 [error] <0.519784.0> ** Generic server <0.519784.0> terminating, ** Last message in was {drop_expired,1}, ** When Server state == {q, {amqqueue, {resource,<<"/">>,queue, <<"versioned_notifications.info">>}, false,false,none,[],<0.519784.0>,[],[],[], [{vhost,<<"/">>}, {name,<<"ttl">>}, {pattern,<<".*">>}, {'apply-to',<<"queues">>}, {definition,[{<<"message-ttl">>,600000}]}, {priority,0}], undefined,[],[],crashed,0,[],<<"/">>, #{user => <<"guest">>, system_creation => 1744865495083934694, recover_on_declare => false, creator =>, {1744865646,"[2111:222::144]",54011, "nova-monitor"}}, rabbit_classic_queue,#{}}, none,false,rabbit_priority_queue, {passthrough,rabbit_variable_queue, {vqstate, {0,[],[]}, {0,[],[]}, {delta,1143,208,0,1351}, {2, [{msg_status,1142, <<58,34,239,5,3,108,176,153,113,187,248,142,237, 44,36,138>>, undefined,false,undefined,rabbit_msg_store,true, msg_store, {message_properties,1745051960184578,false, 10691}}], [{msg_status,1141, <<132,111,216,119,180,208,101,89,75,153,23,139, 114,215,72,23>>, undefined,false,false,rabbit_msg_store,true, msg_store, {message_properties,1745051960018018,false, 10673}}]}, {0,[],[]}, 1351,1141,#{},#{},undefined,rabbit_queue_index, {qistate, "/var/lib/rabbitmq/mnesia/rabbit@zrabbit_2111_222__13f/msg_stores/vhosts/628WB79CIFDYO9LJI6DKMI09L/queues/8WFVPRI0JBBNE48GV1MTNZ681", {#{}, [{segment,0, "/var/lib/rabbitmq/mnesia/rabbit@zrabbit_2111_222__13f/msg_stores/vhosts/628WB79CIFDYO9LJI6DKMI09L/queues/8WFVPRI0JBBNE48GV1MTNZ681/0.idx", {array,2048,0,undefined, {{100,100,100,100,100,100,100,100,100, {10,10,10,10,10,10, {undefined,undefined,undefined, {no_pub,no_del,ack}, {no_pub,no_del,ack}, {no_pub,no_del,ack}
// Log of the 5th exception occurrence:
2025-04-19 16:39:27.243 [error] <0.534938.0> Restarting crashed queue 'versioned_notifications.info' in vhost '/'.
2025-04-19 16:39:27.252 [error] <0.534938.0> ** Generic server <0.534938.0> terminating, ** Last message in was {'$gen_cast', {init,{recovery,{no_barrier,non_clean_shutdown}}}}, ** When Server state == {q,{amqqueue, {resource,<<"/">>,queue, <<"versioned_notifications.info">>}, false,false,none,[],<0.534938.0>,[],[],[], [{vhost,<<"/">>}, {name,<<"ttl">>}, {pattern,<<".*">>}, {'apply-to',<<"queues">>}, {definition,[{<<"message-ttl">>,600000}]}, {priority,0}], undefined,[],[],live,0,[],<<"/">>, #{user => <<"guest">>, system_creation => 1744865495083934694, recover_on_declare => false, creator =>, {1744865646,"[2111:222::144]",54011, "nova-monitor"}}, rabbit_classic_queue,#{}}, none,false,undefined,undefined, {state, {queue,[],[],0}, {active,-576274254690241,1.0}, []}, undefined,undefined,undefined,undefined, {state,none,30000,undefined}, #{},undefined,undefined,undefined, {state,#{},delegate}, undefined,undefined,undefined,undefined, 'drop-head',0,0,running,false,0,undefined,<<"/">>, undefined,0,true,false,false,0,undefined,0,0,0,[]}, ** Reason for termination == , ** {badarg,[{erlang,binary_to_term, [<<131,104,6,119,13,98,97,115,105,99,95,109,101,115,115, 97,103,101,104,4,119,8,114,101,115,111,117,114,99,101, 109,0,0,0,1,47,119,8,101,120,99,104,97,110,103,101,109, 0,0,0,4,110,111,118,97,108,0,0,0,1,109,0,0,0,28,118, 101,114,115,105,111,110,101,100,95,110,111,116,105,102, 105,99,97,116,105,111,110,115,46,105,110,102,111,106, 104,7,119,7,99,111,110,116,101,110,116,97,60,119,4,110, 111,110,101,109,0,0,0,31,248,0,16,97,112,112,108,105, 99,97,116,105,111,110,47,106,115,111,110,5,117,116,102, 45,56,0,0,0,0,2,0,119,25,114,97,98,98,105,116,95,102, 114,97,109,105,110,103,95,97,109,113,112,95,48,95,57, 95,49,108,0,0,0,1,109,0,0,6,157,123,34,111,115,108,111, 46,118,101,114,115,105,111,110
// 5 consecutive abnormalities within 3 seconds, no more attempts to recover.
2025-04-19 16:39:27.264 [error] <0.4827.0> supervisor: {<0.4827.0>,rabbit_amqqueue_sup}, errorContext: shutdown, reason: reached_max_restart_intensity, offender: [{pid,<0.534938.0>}, {id,rabbit_amqqueue}, {mfargs, {rabbit_prequeue,start_link, [{amqqueue, {resource,<<"/">>,queue, <<"versioned_notifications.info">>}, false,false,none,[],none,[],[],[], [{vhost,<<"/">>}, {name,<<"ttl">>}, {pattern,<<".*">>}, {'apply-to',<<"queues">>}, {definition,[{<<"message-ttl">>,600000}]}, {priority,0}], undefined,[],[],live,0,[],<<"/">>, #{user => <<"guest">>, system_creation => 1744865495083934694, recover_on_declare => false, creator =>, {1744865646,"[2111:222::144]",54011, "nova-monitor"}}, rabbit_classic_queue,#{}}, {declare,new}, <0.4826.0>]}}, {restart_type,transient}, {significant,true}, {shutdown,600000}, {child_type,worker}],
// Since the status of the queue in mnesia remains live, subsequent queue.declare operations on this queue will always timeout.
2025-04-19 16:41:24.807 [error] <0.535419.0> Channel error on connection <0.535410.0> ([2111:222::116]:55611 -> [2111:222::13f]:5672, vhost: '/', user: 'guest'), channel 1:, operation queue.declare caused a channel exception not_found: failed to perform operation on queue 'versioned_notifications.info' in vhost '/' due to timeout
2025-04-19 16:41:48.978 [error] <0.535604.0> Channel error on connection <0.535594.0> ([2111:222::144]:59537 -> [2111:222::13f]:5672, vhost: '/', user: 'guest'), channel 1:, operation queue.declare caused a channel exception not_found: failed to perform operation on queue 'versioned_notifications.info' in vhost '/' due to timeout
// According to the previous log, the process <0.534938.0> has exited abnormally.
2025-04-19 16:39:27.243 [error] <0.534938.0> Restarting crashed queue 'versioned_notifications.info' in vhost '/'.
2025-04-19 16:39:27.252 [error] <0.534938.0> ** Generic server <0.534938.0> terminating, ** Last message in was {'$gen_cast', {init,{recovery,{no_barrier,non_clean_shutdown}}}},
//However, the data in the mnesia table shows that the queue status is still live.
[{amqqueue,{resource,<<"/">>,queue,<<"versioned_notifications.info">>},
false,false,none,[],<13588.534938.0>,[],[],[],
[{vhost,<<"/">>},
{name,<<"ttl">>},
{pattern,<<".*">>},
{'apply-to',<<"queues">>},
{definition,[{<<"message-ttl">>,600000}]},
{priority,0}],
undefined,[],[],live,0,[],<<"/">>,
#{user => <<"guest">>},
rabbit_classic_queue,#{}}]
//Theoretically, the queue state should be updated to 'crashed' in the process terminate function.
Here is the code for updating the state in the rabbit_amqqueue_process.erl file:
%% If we crashed don't try to clean up the BQS, probably best to leave it.
terminate(_Reason, State = #q{q = Q}) ->
terminate_shutdown(fun (BQS) ->
Q2 = amqqueue:set_state(Q, crashed),
However, Due to the failure of queue initialization in parsing message store file, its BQS 'backing_queue_state' is undefined.
The terminate_shutdown will not call the function defined here, thus the state of the queue in mnesia remains as live.