Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NAS-134319 / 25.04-RC.1 / [stable/fangtooth] Backport ZFS Fixes from Upstream for RC.1 Freeze #273

Merged
merged 8 commits into from
Feb 19, 2025

Conversation

ixhamza
Copy link

@ixhamza ixhamza commented Feb 19, 2025

Motivation and Context

Since the ZFS 2.3.0 release, there have been few important fixes and optimizations that need to be backported for the RC.1 freeze.

Description

How Has This Been Tested?

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
  • Documentation (a change to man pages or other documentation)

Checklist:

bwatkinson and others added 8 commits February 19, 2025 15:18
Originally openzfs#16856 updated Linux Direct I/O requests to use the new
pin_user_pages API. However, it was an oversight that this PR only
handled iov_iter's of type ITER_IOVEC and ITER_UBUF. Other iov_iter
types may try and use the pin_user_pages API if it is available. This
can lead to panics as the iov_iter is not being iterated over correctly
in zfs_uio_pin_user_pages().

Unfortunately, generic iov_iter API's that call pin_user_page_fast() are
protected as GPL only. Rather than update zfs_uio_pin_user_pages() to
account for all iov_iter types, we can simply just call
zfs_uio_get_dio_page_iov_iter() if the iov_iter type is not ITER_IOVEC
or ITER_UBUF. zfs_uio_get_dio_page_iov_iter() calls the
iov_iter_get_pages() calls that can handle any iov_iter type.

In the future it might be worth using the exposed iov_iter iterator
functions that are included in the header iov_iter.h since v6.7. These
functions allow for any iov_iter type to be iterated over and advanced
while applying a step function during iteration. This could possibly be
leveraged in zfs_uio_pin_user_pages().

A new ZFS test case was added to test that a ITER_BVEC is handled
correctly using this new code path. This test case was provided though
issue openzfs#16956.

Reviewed-by: Alexander Motin <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Ameer Hamza <[email protected]>
Signed-off-by: Brian Atkinson <[email protected]>
Closes openzfs#16956 
Closes openzfs#17006
This change will prevent prefetch to perform unnecessary ARC buffer
fill when reading from disk.

Reviewed-by: Alexander Motin <[email protected]>
Signed-off-by: Jaydeep Kshirsagar <[email protected]>
Co-authored-by: Alexander Motin <[email protected]>
Closes openzfs#17013
As zios are reexecuted after resume from suspension, their ready and
wait states need to be propagated to wait counts on all their parents.

It's possible for those parents to have active children passing through
READY or DONE, which then end up in zio_notify_parent(), take their
parent's lock, and decrement the wait count. Without also taking a lock
here, it's possible for an increment race to occur, which leads to
either there being no references left (tripping the assert in
zio_notify_parent()), or a parent waiting forever for a nonexistent
child to complete.

To protect against this, we simply take the appropriate zio locks in
zio_reexecute() before updating the wait counts.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.

Reviewed-by: Allan Jude <[email protected]>
Reviewed-by: Alexander Motin <[email protected]>
Signed-off-by: Rob Norris <[email protected]>
Closes openzfs#17016
When you are using large recordsizes in conjunction with raidz, with
incompressible data, you can pretty reliably be making 21 MB
allocations. Unfortunately, the fragmentation metric in ZFS considers
any metaslabs with 16 MB free chunks completely unfragmented, so you can
have a metaslab report 0% fragmented and be unable to satisfy an
allocation. When using the segment-based metaslab weight, this is
inconvenient; when using the space-based one, it can seriously degrade
performance.

We expand the fragmentation table to extend up to 512MB, and redefine
the table size based on the actual table, rather than having a static
define. We also tweak the one variable that depends on fragmentation
directly.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Allan Jude <[email protected]>
Reviewed-by: Alexander Motin <[email protected]>
Signed-off-by: Paul Dagnelie <[email protected]>
Closes openzfs#16986
recv_fix_encryption_hierarchy() in its present state goes through all
stream filesystems, and for each one traverses the snapshots in order to
find one that exists locally. This happens by calling guid_to_name() for
each snapshot, which iterates through all children of the filesystem.
This results in CPU utilization of 100% for several minutes (for ~1000
filesystems on a Ryzen 4350G) for 1 thread at the end of a raw receive
(-w, regardless whether encrypted or not, dryrun or not).

Fix this by following a different logic: using the top_fs name, call
gather_nvlist() to gather the nvlists for all local filesystems. For
each one filesystem, go through the snapshots to find the corresponding
stream's filesystem (since we know the snapshots guid and can search
with it in stream_avl for the stream's fs). Then go on to fix the
encryption roots and locations as in its present state.

Avoiding guid_to_name() iteratively makes
recv_fix_encryption_hierarchy() significantly faster (from several
minutes to seconds for ~1000 filesystems on a Ryzen 4350G).

Another problem is the following: in case we have promoted a clone of
the filesystem outside the top filesystem specified in zfs send, zfs
receive does not fail but returns an error:
recv_incremental_replication() fails to find its origin and errors out
with needagain=1. This results in recv_fix_hierarchy() not being called
which may render some children of the top fs not mountable since their
encryption root was not updated. To circumvent this make
recv_incremental_replication() silently ignore this error.

Reviewed-by: Alexander Motin <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: George Amanakis <[email protected]>
Closes openzfs#16929
For zfs_rename, after the dataset name is successfully updated,
the dataset handle that was passed to zfs_rename, still contains
the old name, due to which, the dataset handle becomes invalid.
The following operations performed using this handle result in
error since the dataset with old name cannot be found anymore.

changelist_rename does update the names in dataset handles,
but those are temporary handles that were created during
changelist_gather. The original handle that was used to call
zfs_rename is not updated.

We should update the name in original ZFS handle after the IOCTL
for rename returns success for the operation.

Signed-off-by: Umer Saleem <[email protected]>
Reviewed-by: Ameer Hamza <[email protected]>
Reviewed-by: Alexander Motin <[email protected]>
Reviewed-by: Tony Hutter <[email protected]>
skc->skc_name also needs to be freed in an error path.

Reviewed-by: Alexander Motin <[email protected]>
Reviewed-by: Tino Reichardt <[email protected]>
Signed-off-by: Vandana Rungta <[email protected]>
Closes openzfs#17041
Since we are calculating a free space fragmentation, we should
weight metaslabs by the amount of their free space, not a full
size.  Fragmentation of full metaslabs may not matter in presence
empty ones.  The old algorithm did not differentiate metaslabs
having only one free 4KB block from metaslabs having 50% of space
free in 4KB blocks, reporting higher fragmentation.

While there, move metaslab_group_alloc_update() call after setting
mg_fragmentation, otherwise the effect may be delayed by one TXG.

Signed-off-by:	Alexander Motin <[email protected]>
Sponsored by:	iXsystems, Inc.
Reviewed-by: Paul Dagnelie <[email protected]>
Reviewed-by: Tony Nguyen <[email protected]>
Reviewed-by: Tony Hutter <[email protected]>
@ixhamza ixhamza requested review from amotin and usaleem-ix February 19, 2025 17:20
@bugclerk bugclerk changed the title [stable/fangtooth] Backport ZFS Fixes from Upstream for RC.1 Freeze NAS-134319 / 25.04-RC.1 / [stable/fangtooth] Backport ZFS Fixes from Upstream for RC.1 Freeze Feb 19, 2025
@bugclerk
Copy link

@amotin amotin merged commit 2c37149 into stable/fangtooth Feb 19, 2025
42 of 72 checks passed
@bugclerk
Copy link

This PR has been merged and conversations have been locked.
If you would like to discuss more about this issue please use our forums or raise a Jira ticket.

@truenas truenas locked as resolved and limited conversation to collaborators Feb 19, 2025
@amotin amotin deleted the NAS-134319-ft branch February 19, 2025 20:27
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants