Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NAS-135033 / 25.04.1 / zed: Ensure spare activation after kernel-initiated device removal #289

Merged
merged 1 commit into from
Mar 28, 2025

Conversation

ixhamza
Copy link

@ixhamza ixhamza commented Mar 28, 2025

Motivation and Context

zed fails to activate hotspare if a device is removed by the kernel.

Description

In addition to hotplug events, the kernel may also mark a failing vdev as REMOVED. This was observed in one of our customer report and reproduced by forcing the NVMe host driver to disable the device after a failed reset due to command timeout. In such cases, the spare was not activated because the device had already transitioned to a REMOVED state before zed processed the event.
To address this, explicitly attempt hot spare activation when the kernel marks a device as REMOVED.

How Has This Been Tested?

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
  • Documentation (a change to man pages or other documentation)

Checklist:

In addition to hotplug events, the kernel may also mark a failing vdev
as REMOVED. This was observed in a customer report and reproduced by
forcing the NVMe host driver to disable the device after a failed reset
due to command timeout. In such cases, the spare was not activated
because the device had already transitioned to a REMOVED state before
zed processed the event.
To address this, explicitly attempt hot spare activation when the
kernel marks a device as REMOVED.

Signed-off-by: Ameer Hamza <[email protected]>
@ixhamza ixhamza requested a review from amotin March 28, 2025 18:12
@bugclerk bugclerk changed the title zed: Ensure spare activation after kernel-initiated device removal NAS-135033 / 25.04.1 / zed: Ensure spare activation after kernel-initiated device removal Mar 28, 2025
@bugclerk
Copy link

@amotin amotin merged commit 521e62c into stable/fangtooth Mar 28, 2025
15 of 20 checks passed
@amotin amotin deleted the NAS-135033-ft branch March 28, 2025 19:51
@bugclerk
Copy link

This PR has been merged and conversations have been locked.
If you would like to discuss more about this issue please use our forums or raise a Jira ticket.

@truenas truenas locked as resolved and limited conversation to collaborators Mar 28, 2025
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants