Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NAS-135033 / 24.10.2.1 / zed: Ensure spare activation after kernel-initiated device removal #290

Merged
merged 2 commits into from
Mar 28, 2025

Conversation

ixhamza
Copy link

@ixhamza ixhamza commented Mar 28, 2025

Motivation and Context

zed fails to activate hotspare if a device is removed by the kernel.

Description

In addition to hotplug events, the kernel may also mark a failing vdev as REMOVED. This was observed in one of our customer report and reproduced by forcing the NVMe host driver to disable the device after a failed reset due to command timeout. In such cases, the spare was not activated because the device had already transitioned to a REMOVED state before zed processed the event.
To address this, explicitly attempt hot spare activation when the kernel marks a device as REMOVED.
This patch also backports openzfs#16751 to ensure clean merge.

How Has This Been Tested?

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
  • Documentation (a change to man pages or other documentation)

Checklist:

ixhamza added 2 commits March 28, 2025 03:17
When an OFFLINE device is physically removed, a spare is automatically
activated. However, this behavior differs in FreeBSD, where we do not
transition from OFFLINE state to REMOVED.
Our support team has encountered cases where customers experienced
unexpected behavior during drive replacements, with multiple spares
activating for the same VDEV due to a single disk replacement. This
patch ensures that a drive in an OFFLINE state remains in that state,
preventing it from transitioning to REMOVED and being automatically
replaced by a spare.

Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Alexander Motin <[email protected]>
Signed-off-by: Ameer Hamza <[email protected]>
Closes openzfs#16751
In addition to hotplug events, the kernel may also mark a failing vdev
as REMOVED. This was observed in a customer report and reproduced by
forcing the NVMe host driver to disable the device after a failed reset
due to command timeout. In such cases, the spare was not activated
because the device had already transitioned to a REMOVED state before
zed processed the event.
To address this, explicitly attempt hot spare activation when the
kernel marks a device as REMOVED.

Signed-off-by: Ameer Hamza <[email protected]>
@ixhamza ixhamza requested a review from amotin March 28, 2025 18:12
@bugclerk bugclerk changed the title zed: Ensure spare activation after kernel-initiated device removal NAS-135033 / None / zed: Ensure spare activation after kernel-initiated device removal Mar 28, 2025
@bugclerk
Copy link

@ixhamza ixhamza changed the title NAS-135033 / None / zed: Ensure spare activation after kernel-initiated device removal NAS-135033 / 24.10.2.1 / zed: Ensure spare activation after kernel-initiated device removal Mar 28, 2025
@amotin amotin merged commit d8c2ec4 into release/24.10.2.1 Mar 28, 2025
20 of 23 checks passed
@amotin amotin deleted the NAS-135033-24.10.2.1 branch March 28, 2025 19:52
@bugclerk
Copy link

Not updating JIRA ticket https://ixsystems.atlassian.net/browse/NAS-135033 target versions as no JIRA version corresponds to this PR

@bugclerk
Copy link

This PR has been merged and conversations have been locked.
If you would like to discuss more about this issue please use our forums or raise a Jira ticket.

@truenas truenas locked as resolved and limited conversation to collaborators Mar 28, 2025
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants