Introduce zfs rewrite subcommand #17246

amotin · 2025-04-15T20:33:49Z

Motivation and Context

For years users were asking for an ability to re-balance pool after vdev addition, de-fragment randomly written files, change some properties for already written files, etc. The closest option would be to either copy and rename a file or send/receive/rename the dataset. Unfortunately all of those options have some downsides.

Description

This change introduces new zfs rewrite subcommand, that allows to rewrite content of specified file(s) as-is without modifications, but at a different location, compression, checksum, dedup, copies and other parameter values. It is faster than read plus write, since it does not require data copying to user-space. It is also faster for sync=always datasets, since without data modification it does not require ZIL writing. Also since it is protected by normal range range locks, it can be done under any other load. Also it does not affect file's modification time or other properties.

How Has This Been Tested?

Manually tested it on FreeBSD. Linux-specific code is not yet tested.

Types of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Performance enhancement (non-breaking change which improves efficiency)
Code cleanup (non-breaking change which makes code smaller or more readable)
Breaking change (fix or feature that would cause existing functionality to change)
Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
Documentation (a change to man pages or other documentation)

Checklist:

My code follows the OpenZFS code style requirements.
I have updated the documentation accordingly.
I have read the contributing document.
I have added tests to cover my changes.
I have run the ZFS Test Suite with this change applied.
All commit messages are properly formatted and contain Signed-off-by.

amotin · 2025-04-15T20:53:03Z

I've tried to find some kernel APIs to wire this to, but found that plenty of Linux file systems each implement their own IOCTL's for similar purposes. I did the same, except the IOCTL number I took almost arbitrary, since ZFS seems quite rough in this area. I am open to any better ideas before this is committed.

module/zfs/zfs_vnops.c

HPPinata · 2025-04-15T23:34:16Z

This looks amazing! Not having to sift through half a dozen shell scripts every time this comes up to see what currently handles the most edge cases correctly is very much appreciated. Especially with RaidZ expansion, being able to direct users to run a built-in command instead of debating what script to send them to would be very nice.

Also being able to reliably rewrite a live dataset while it's in use without having to worry about skipped files or mtime conflicts would make the whole process much less of a hassle. With the only thing to really worry about being snapshots/space usage this seems as close to perfect as reasonably possible (without diving deep into internals and messing with snapshot immutability). Bravo!

This allows to rewrite content of specified file(s) as-is without modifications, but at a different location, compression, checksum, dedup, copies and other parameter values. It is faster than read plus write, since it does not require data copying to user-space. It is also faster for sync=always datasets, since without data modification it does not require ZIL writing. Also since it is protected by normal range range locks, it can be done under any other load. Also it does not affect file's modification time or other properties. Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc.

clhedrick · 2025-04-16T14:20:08Z

thank you. Fixes one of the biggest problems with ZFS.

Is there a way to suspend the process? It might be nice to have it run only during off hours.

amotin · 2025-04-16T14:41:06Z

Is there a way to suspend the process? It might be nice to have it run only during off hours.

It does one file at a time, and should be killable in between. Signal handling within one huge file can probably be added. Though the question of the process restart is on the user. I didn't plan to go that deep into the area within this PR.

clhedrick · 2025-04-16T20:14:16Z

I couldn't find documentation in the files changed, so I have to guess how it actually works. Is it a file at a time? I guess you could feed it with a "find" command. For a system with a billion files, do you have a sense how long this is gong to take? We can do scrubs in a day or two, but rsync is impractically slow. If this is happening at the file system level, that migth be the case here as well.

stuartthebruce · 2025-04-16T20:29:35Z

I guess you could feed it with a "find" command.

This will likely be a good use case for GNU Parallel.

HPPinata · 2025-04-16T20:49:48Z

I couldn't find documentation in the files changed, so I have to guess how it actually works. Is it a file at a time? I guess you could feed it with a "find" command. For a system with a billion files, do you have a sense how long this is gong to take? We can do scrubs in a day or two, but rsync is impractically slow. If this is happening at the file system level, that migth be the case here as well.

It can take a directory as an argument and there are some recursive functions and iterators in the code so piping find into it should not be necessary. That avoids some userspace file handling overhead, but it still has to go through the contents of each directory one file at a time. I also don't see any parallel execution or threading (though I'm not too familiar with ZFS internals, maybe some of the primitives used here run asynchronously?).

Whether doing parallelism in userspace by just calling it for many files/directories at once or not it should have the required locking to just run in the background and be significantly more elegant than the CP + mtime (or potentially userspace hash) check to make sure files didn't change during the copy process avoiding one of the potential pitfalls of existing solutions.

amotin · 2025-04-16T21:26:27Z

I haven't benchmarked it deep yet, but unless the files are tiny, I don't expect there is a major need for parallelism. The code in kernel should handle up to 16MB at a time, plus allows ZFS to do read-ahead and write-back on top of that, so there will be quite a lot in the pipeline to saturate the disks and/or the system, especially if there is some compression/checksuming/encryption. And without need to copy data to/from user-space, the only thread will not be doing too much, I think mostly a decompression from ARC. Bunch of small files on a wide HDD pool I suspect may indeed suffer from read latency, but that in user-space we can optimize/parallelize all day long.

tonyhutter · 2025-04-16T21:31:18Z

I gave this a quick test. It's very fast and does exactly what it says 👍

# Copy ZFS source workspace to pool with compression=off
$ time cp -a ~/zfs /tank2

real	0m0.600s
user	0m0.032s
sys	0m0.519s

$ df -h /tank2
Filesystem      Size  Used Avail Use% Mounted on
tank2           9.3G  893M  8.4G  10% /tank2


# Set compression to 'gzip' and rewrite
$ sudo ./zfs set compression=gzip tank2
$ time sudo ./zfs rewrite -r /tank2

real	0m2.272s
user	0m0.005s
sys	0m0.005s

$ df -h /tank2
Filesystem      Size  Used Avail Use% Mounted on
tank2           9.3G  402M  8.9G   5% /tank2


# Set compression to 'lz4' and rewrite
$ sudo ./zfs set compression=lz4 tank2
$ time sudo ./zfs rewrite -r /tank2
real	0m1.947s
user	0m0.002s
sys	0m0.010s

$ df -h /tank2
Filesystem      Size  Used Avail Use% Mounted on
tank2           9.3G  456M  8.8G   5% /tank2


# Set compression to 'zstd' and rewrite
$ sudo ./zfs set compression=zstd tank2
$ time sudo ./zfs rewrite -r /tank2

real	0m0.616s
user	0m0.003s
sys	0m0.006s

$ df -h /tank2
Filesystem      Size  Used Avail Use% Mounted on
tank2           9.3G  366M  8.9G   4% /tank2

I can already see people writing scripts that go though every dataset, setting the optimal compression, recordsize, etc, and zfs rewrite-ing them.

amotin · 2025-04-16T21:42:51Z

Cool! Though the recordsize is one of things it can't change, since it would requite real byte-level copy, not just marking existing blocks dirty. I am not sure it can be done under the load in general. At least it would be much more complicated.

github-actions bot added the Status: Work in Progress Not yet ready for general review label Apr 15, 2025

amotin force-pushed the rewrite branch from e6a1719 to eec53cf Compare April 15, 2025 20:48

gmelikov reviewed Apr 15, 2025

View reviewed changes

module/zfs/zfs_vnops.c Outdated Show resolved Hide resolved

amotin force-pushed the rewrite branch from eec53cf to 90451a8 Compare April 16, 2025 02:28

amotin added the Status: Design Review Needed Architecture or design is under discussion label Apr 16, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Introduce zfs rewrite subcommand #17246

Introduce zfs rewrite subcommand #17246

amotin commented Apr 15, 2025

amotin commented Apr 15, 2025

HPPinata commented Apr 15, 2025

clhedrick commented Apr 16, 2025

amotin commented Apr 16, 2025 •

edited

Loading

clhedrick commented Apr 16, 2025

stuartthebruce commented Apr 16, 2025

HPPinata commented Apr 16, 2025

amotin commented Apr 16, 2025

tonyhutter commented Apr 16, 2025 •

edited

Loading

amotin commented Apr 16, 2025

Introduce zfs rewrite subcommand #17246

Are you sure you want to change the base?

Introduce zfs rewrite subcommand #17246

Conversation

amotin commented Apr 15, 2025

Motivation and Context

Description

How Has This Been Tested?

Types of changes

Checklist:

amotin commented Apr 15, 2025

HPPinata commented Apr 15, 2025

clhedrick commented Apr 16, 2025

amotin commented Apr 16, 2025 • edited Loading

clhedrick commented Apr 16, 2025

stuartthebruce commented Apr 16, 2025

HPPinata commented Apr 16, 2025

amotin commented Apr 16, 2025

tonyhutter commented Apr 16, 2025 • edited Loading

amotin commented Apr 16, 2025

amotin commented Apr 16, 2025 •

edited

Loading

tonyhutter commented Apr 16, 2025 •

edited

Loading