Skip to content

DO NOT MERGE When a queue bq_init fails, its state is not updated #14120

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

tomyouyou
Copy link
Contributor

@tomyouyou tomyouyou commented Jun 25, 2025

If a queue bq_init fails, its 'backing_queue_state' is 'undefined', while its state in mnesia remains 'live' due to the inability to update. In fact, its state in mnesia should be updated to 'crashed' in the 'terminate' function.

Sorry, I tried several times but couldn't reproduce the issue.
The logs posted earlier occurred in our production environment. From the logs, it appears that the queue_index file of that queue was corrupted, causing an exception when qpid tried to parse it.

Please allow me to explain the issue.

// Log of the first exception occurrence:

2025-04-19 16:39:24.467 [error] <0.519784.0> ** Generic server <0.519784.0> terminating, ** Last message in was {drop_expired,1}, ** When Server state == {q, {amqqueue, {resource,<<"/">>,queue, <<"versioned_notifications.info">>}, false,false,none,[],<0.519784.0>,[],[],[], [{vhost,<<"/">>}, {name,<<"ttl">>}, {pattern,<<".*">>}, {'apply-to',<<"queues">>}, {definition,[{<<"message-ttl">>,600000}]}, {priority,0}], undefined,[],[],crashed,0,[],<<"/">>, #{user => <<"guest">>, system_creation => 1744865495083934694, recover_on_declare => false, creator =>, {1744865646,"[2111:222::144]",54011, "nova-monitor"}}, rabbit_classic_queue,#{}}, none,false,rabbit_priority_queue, {passthrough,rabbit_variable_queue, {vqstate, {0,[],[]}, {0,[],[]}, {delta,1143,208,0,1351}, {2, [{msg_status,1142, <<58,34,239,5,3,108,176,153,113,187,248,142,237, 44,36,138>>, undefined,false,undefined,rabbit_msg_store,true, msg_store, {message_properties,1745051960184578,false, 10691}}], [{msg_status,1141, <<132,111,216,119,180,208,101,89,75,153,23,139, 114,215,72,23>>, undefined,false,false,rabbit_msg_store,true, msg_store, {message_properties,1745051960018018,false, 10673}}]}, {0,[],[]}, 1351,1141,#{},#{},undefined,rabbit_queue_index, {qistate, "/var/lib/rabbitmq/mnesia/rabbit@zrabbit_2111_222__13f/msg_stores/vhosts/628WB79CIFDYO9LJI6DKMI09L/queues/8WFVPRI0JBBNE48GV1MTNZ681", {#{}, [{segment,0, "/var/lib/rabbitmq/mnesia/rabbit@zrabbit_2111_222__13f/msg_stores/vhosts/628WB79CIFDYO9LJI6DKMI09L/queues/8WFVPRI0JBBNE48GV1MTNZ681/0.idx", {array,2048,0,undefined, {{100,100,100,100,100,100,100,100,100, {10,10,10,10,10,10, {undefined,undefined,undefined, {no_pub,no_del,ack}, {no_pub,no_del,ack}, {no_pub,no_del,ack}

// Log of the 5th exception occurrence:

2025-04-19 16:39:27.243 [error] <0.534938.0> Restarting crashed queue 'versioned_notifications.info' in vhost '/'.
2025-04-19 16:39:27.252 [error] <0.534938.0> ** Generic server <0.534938.0> terminating, ** Last message in was {'$gen_cast', {init,{recovery,{no_barrier,non_clean_shutdown}}}}, ** When Server state == {q,{amqqueue, {resource,<<"/">>,queue, <<"versioned_notifications.info">>}, false,false,none,[],<0.534938.0>,[],[],[], [{vhost,<<"/">>}, {name,<<"ttl">>}, {pattern,<<".*">>}, {'apply-to',<<"queues">>}, {definition,[{<<"message-ttl">>,600000}]}, {priority,0}], undefined,[],[],live,0,[],<<"/">>, #{user => <<"guest">>, system_creation => 1744865495083934694, recover_on_declare => false, creator =>, {1744865646,"[2111:222::144]",54011, "nova-monitor"}}, rabbit_classic_queue,#{}}, none,false,undefined,undefined, {state, {queue,[],[],0}, {active,-576274254690241,1.0}, []}, undefined,undefined,undefined,undefined, {state,none,30000,undefined}, #{},undefined,undefined,undefined, {state,#{},delegate}, undefined,undefined,undefined,undefined, 'drop-head',0,0,running,false,0,undefined,<<"/">>, undefined,0,true,false,false,0,undefined,0,0,0,[]}, ** Reason for termination == , ** {badarg,[{erlang,binary_to_term, [<<131,104,6,119,13,98,97,115,105,99,95,109,101,115,115, 97,103,101,104,4,119,8,114,101,115,111,117,114,99,101, 109,0,0,0,1,47,119,8,101,120,99,104,97,110,103,101,109, 0,0,0,4,110,111,118,97,108,0,0,0,1,109,0,0,0,28,118, 101,114,115,105,111,110,101,100,95,110,111,116,105,102, 105,99,97,116,105,111,110,115,46,105,110,102,111,106, 104,7,119,7,99,111,110,116,101,110,116,97,60,119,4,110, 111,110,101,109,0,0,0,31,248,0,16,97,112,112,108,105, 99,97,116,105,111,110,47,106,115,111,110,5,117,116,102, 45,56,0,0,0,0,2,0,119,25,114,97,98,98,105,116,95,102, 114,97,109,105,110,103,95,97,109,113,112,95,48,95,57, 95,49,108,0,0,0,1,109,0,0,6,157,123,34,111,115,108,111, 46,118,101,114,115,105,111,110

// 5 consecutive abnormalities within 3 seconds, no more attempts to recover.

2025-04-19 16:39:27.264 [error] <0.4827.0> supervisor: {<0.4827.0>,rabbit_amqqueue_sup}, errorContext: shutdown, reason: reached_max_restart_intensity, offender: [{pid,<0.534938.0>}, {id,rabbit_amqqueue}, {mfargs, {rabbit_prequeue,start_link, [{amqqueue, {resource,<<"/">>,queue, <<"versioned_notifications.info">>}, false,false,none,[],none,[],[],[], [{vhost,<<"/">>}, {name,<<"ttl">>}, {pattern,<<".*">>}, {'apply-to',<<"queues">>}, {definition,[{<<"message-ttl">>,600000}]}, {priority,0}], undefined,[],[],live,0,[],<<"/">>, #{user => <<"guest">>, system_creation => 1744865495083934694, recover_on_declare => false, creator =>, {1744865646,"[2111:222::144]",54011, "nova-monitor"}}, rabbit_classic_queue,#{}}, {declare,new}, <0.4826.0>]}}, {restart_type,transient}, {significant,true}, {shutdown,600000}, {child_type,worker}],

// Since the status of the queue in mnesia remains live, subsequent queue.declare operations on this queue will always timeout.

2025-04-19 16:41:24.807 [error] <0.535419.0> Channel error on connection <0.535410.0> ([2111:222::116]:55611 -> [2111:222::13f]:5672, vhost: '/', user: 'guest'), channel 1:, operation queue.declare caused a channel exception not_found: failed to perform operation on queue 'versioned_notifications.info' in vhost '/' due to timeout
2025-04-19 16:41:48.978 [error] <0.535604.0> Channel error on connection <0.535594.0> ([2111:222::144]:59537 -> [2111:222::13f]:5672, vhost: '/', user: 'guest'), channel 1:, operation queue.declare caused a channel exception not_found: failed to perform operation on queue 'versioned_notifications.info' in vhost '/' due to timeout

// According to the previous log, the process <0.534938.0> has exited abnormally.
2025-04-19 16:39:27.243 [error] <0.534938.0> Restarting crashed queue 'versioned_notifications.info' in vhost '/'.
2025-04-19 16:39:27.252 [error] <0.534938.0> ** Generic server <0.534938.0> terminating, ** Last message in was {'$gen_cast', {init,{recovery,{no_barrier,non_clean_shutdown}}}},

//However, the data in the mnesia table shows that the queue status is still live.

[{amqqueue,{resource,<<"/">>,queue,<<"versioned_notifications.info">>},
false,false,none,[],<13588.534938.0>,[],[],[],
[{vhost,<<"/">>},
{name,<<"ttl">>},
{pattern,<<".*">>},
{'apply-to',<<"queues">>},
{definition,[{<<"message-ttl">>,600000}]},
{priority,0}],
undefined,[],[],live,0,[],<<"/">>,
#{user => <<"guest">>},
rabbit_classic_queue,#{}}]

//Theoretically, the queue state should be updated to 'crashed' in the process terminate function.
Here is the code for updating the state in the rabbit_amqqueue_process.erl file:

%% If we crashed don't try to clean up the BQS, probably best to leave it.
terminate(_Reason, State = #q{q = Q}) ->
terminate_shutdown(fun (BQS) ->
Q2 = amqqueue:set_state(Q, crashed),

However, Due to the failure of queue initialization in parsing message store file, its BQS 'backing_queue_state' is undefined.
The terminate_shutdown will not call the function defined here, thus the state of the queue in mnesia remains as live.

If a queue bq_init fails, its 'backing_queue_state' is 'undefined', while its state in mnesia remains 'live' due to the inability to update.
In fact, its state in mnesia should be updated to 'crashed' in the 'terminate' function.

Here is an example:

2025-04-19 16:39:24.535 [error] <0.4827.0>     supervisor: {<0.4827.0>,rabbit_amqqueue_sup}, errorContext: child_terminated, reason: {function_clause, [{rabbit_queue_index,segment_plus_journal1, [undefined,{no_pub,no_del,ack}], [{file,"rabbit_queue_index.erl"},{line,1344}]}, {rabbit_queue_index,'-segment_plus_journal/2-fun-0-',3, [{file,"rabbit_queue_index.erl"},{line,1331}]}, {array,sparse_foldl_3,7,[{file,"array.erl"},{line,1687}]}, {array,sparse_foldl_2,9,[{file,"array.erl"},{line,1681}]}, {rabbit_queue_index,segment_entries_foldr,3, [{file,"rabbit_queue_index.erl"},{line,1230}]}, {rabbit_queue_index,read_bounded_segment,5, [{file,"rabbit_queue_index.erl"},{line,1211}]}, {rabbit_queue_index,read,3, [{file,"rabbit_queue_index.erl"},{line,500}]}, {rabbit_variable_queue,maybe_deltas_to_betas,3, [{file,"rabbit_variable_queue.erl"},{line,2635}]}]}, offender: [{pid,<0.519784.0>}, {id,rabbit_amqqueue}, {mfargs, {rabbit_prequeue,start_link, [{amqqueue, {resource,<<"/">>,queue, <<"versioned_notifications.info">>}, false,false,none,[],none,[],[],[], [{vhost,<<"/">>}, {name,<<"ttl">>}, {pattern,<<".*">>}, {'apply-to',<<"queues">>}, {definition,[{<<"message-ttl">>,600000}]}, {priority,0}], undefined,[],[],live,0,[],<<"/">>, #{user => <<"guest">>, system_creation => 1744865495083934694, recover_on_declare => false, creator =>, {1744865646,"[2111:222::144]",54011, "nova-monitor"}}, rabbit_classic_queue,#{}}, {declare,new}, <0.4826.0>]}}, {restart_type,transient}, {significant,true}, {shutdown,600000}, {child_type,worker}],
2025-04-19 16:39:25.619 [error] <0.534924.0> Restarting crashed queue 'versioned_notifications.info' in vhost '/'.
2025-04-19 16:39:25.633 [error] <0.534924.0> ** Generic server <0.534924.0> terminating, ** Last message in was {'$gen_cast', {init,{recovery,{no_barrier,non_clean_shutdown}}}}, ** When Server state == {q,{amqqueue, {resource,<<"/">>,queue, <<"versioned_notifications.info">>}, false,false,none,[],<0.534924.0>,[],[],[], [{vhost,<<"/">>}, {name,<<"ttl">>}, {pattern,<<".*">>}, {'apply-to',<<"queues">>}, {definition,[{<<"message-ttl">>,600000}]}, {priority,0}], undefined,[],[],live,0,[],<<"/">>, #{user => <<"guest">>, system_creation => 1744865495083934694, recover_on_declare => false, creator =>, {1744865646,"[2111:222::144]",54011, "nova-monitor"}}, rabbit_classic_queue,#{}}, none,false,undefined,undefined, {state, {queue,[],[],0}, {active,-576274256314339,1.0}, []}, undefined,undefined,undefined,undefined, {state,none,30000,undefined}, #{},undefined,undefined,undefined, {state,#{},delegate}, undefined,undefined,undefined,undefined, 'drop-head',0,0,running,false,0,undefined,<<"/">>, undefined,0,true,false,false,0,undefined,0,0,0,[]}, ** Reason for termination == , ** {badarg,[{erlang,binary_to_term, [<<131,104,6,119,13,98,97,115,105,99,95,109,101,115,115, 97,103,101,104,4,119,8,114,101,115,111,117,114,99,101, 109,0,0,0,1,47,119,8,101,120,99,104,97,110,103,101,109, 0,0,0,4,110,111,118,97,108,0,0,0,1,109,0,0,0,28,118, 101,114,115,105,111,110,101,100,95,110,111,116,105,102, 105,99,97,116,105,111,110,115,46,105,110,102,111,106, 104,7,119,7,99,111,110,116,101,110,116,97,60,119,4,110, 111,110,101,109,0,0,0,31,248,0,16,97,112,112,108,105, 99,97,116,105,111,110,47,106,115,111,110,5,117,116,102, 45,56,0,0,0,0,2,0,119,25,114,97,98,98,105,116,95,102, 114,97,109,105,110,103,95,97,109,113,112,95,48,95,57, 95,49,108,0,0,0,1,109,0,0


2025-04-19 16:39:27.264 [error] <0.4827.0>     supervisor: {<0.4827.0>,rabbit_amqqueue_sup}, errorContext: shutdown, reason: reached_max_restart_intensity, offender: [{pid,<0.534938.0>}, {id,rabbit_amqqueue}, {mfargs, {rabbit_prequeue,start_link, [{amqqueue, {resource,<<"/">>,queue, <<"versioned_notifications.info">>}, false,false,none,[],none,[],[],[], [{vhost,<<"/">>}, {name,<<"ttl">>}, {pattern,<<".*">>}, {'apply-to',<<"queues">>}, {definition,[{<<"message-ttl">>,600000}]}, {priority,0}], undefined,[],[],live,0,[],<<"/">>, #{user => <<"guest">>, system_creation => 1744865495083934694, recover_on_declare => false, creator =>, {1744865646,"[2111:222::144]",54011, "nova-monitor"}}, rabbit_classic_queue,#{}}, {declare,new}, <0.4826.0>]}}, {restart_type,transient}, {significant,true}, {shutdown,600000}, {child_type,worker}],


2025-04-19 16:41:24.807 [error] <0.535419.0> Channel error on connection <0.535410.0> ([2111:222::116]:55611 -> [2111:222::13f]:5672, vhost: '/', user: 'guest'), channel 1:, operation queue.declare caused a channel exception not_found: failed to perform operation on queue 'versioned_notifications.info' in vhost '/' due to timeout
@lhoguin
Copy link
Contributor

lhoguin commented Jun 25, 2025

A test would be good to have to confirm that the claim about Mnesia is an issue, and that this fixes it. Or at least a way to reproduce.

Copy link
Collaborator

@michaelklishin michaelklishin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The number of test suites that fail is so varied and substantial that I highly doubt that it can be attributed to flakes or the fact that some suites require access to GitHub secrets.

@michaelklishin michaelklishin changed the title When a queue bq_init fails, its state is not updated. DO NOT MERGE When a queue bq_init fails, its state is not updated Jun 25, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants