Skip to content

Bug: Collator does not resume block production after being restarted while benched by collator rotation #693

@Luca-Poggi

Description

@Luca-Poggi

Description

Collators may stop producing blocks indefinitely if they are restarted during a session in which they have been benched by the new collator rotation mechanism.

With the new collator rotation feature, one collator is temporarily removed from the active collator set during odd sessions. I observed that if the benched collator node is restarted while it is not part of the active session set, it does not resume block production in the following session, even though it should become active again.

The collator remains synced and running, but it does not author blocks anymore. The issue persists indefinitely across subsequent sessions. Restarting the same collator again during a session in which it is expected to be active fixes the issue and block production resumes.

This looks like a bug in how the collator node handles session/authority changes when it starts while it is currently benched.

Expected Behavior

A collator restarted during a session in which it is benched should automatically resume block production once it becomes part of the active collator set again in the next applicable session.

The node should correctly detect the session transition and start authoring blocks again without requiring an additional manual restart.

Actual Behavior

If a collator is restarted while it is benched, it does not resume block production when it becomes eligible again.

Instead, it continues running but does not author blocks, even across later sessions where it should be active. The issue appears to persist indefinitely until the collator is manually restarted again during a session in which it is expected to produce blocks.

Possible Fix

The issue may be related to the node not properly updating or reinitializing its authoring role after a session transition when it was started while not included in the active collator set.

A possible area to investigate is the interaction between the new pallet-collator-rotation session manager wrapper and the collator authoring/session key logic on the node side.

The node should probably re-check whether it is part of the active authority/collator set at every session transition and enable block production accordingly, even if it was started during a session where it was temporarily benched.

Steps to Reproduce

  1. Run a Basilisk collator that is part of the configured collator set.
  2. Wait for a session in which this collator is benched by the collator rotation mechanism.
  3. Restart the collator node during that benched session.
  4. Wait until the following session, where the collator should become active again.
  5. Observe that the collator does not resume block production.
  6. Wait for additional sessions and observe that the collator still does not produce blocks.
  7. Restart the collator again during a session in which it is expected to be active.
  8. Observe that block production resumes after the restart.

Context

This issue affects collator reliability after the introduction of the collator rotation feature.

A routine restart, upgrade, crash recovery, or infrastructure maintenance operation may permanently stop a collator from producing blocks if it happens during the session where the collator is benched.

This is particularly problematic because the node appears to remain online and synced, but silently stops authoring blocks until another manual restart is performed at the right time.

Your Environment

  • Version used: Basilisk runtime including the new collator rotation feature introduced in PR feat: collator rotation #690
  • System type: Collator node

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions