btrfs-progs: balance: add extra delay if converting with a missing de… #946

adam900710 · 2025-02-03T04:17:36Z

…vice

[BUG]
There is a reproducer that can trigger btrfs to flips RO:

mkfs.btrfs -f -mraid1 -draid1 /dev/sdd /dev/sde

mount /dev/sdd /mnt/btrfs

echo 1 > /sys/block/sde/device/delete

btrfs balance start -mconvert=dup -dconvert=single /mnt/btrfs

ERROR: error during balancing '.': Input/output error
There may be more info in syslog - try dmesg | tail

Then btrfs will flip read-only with the following errors:

btrfs: attempt to access beyond end of device
sde: rw=6145, sector=21696, nr_sectors = 32 limit=0
btrfs: attempt to access beyond end of device
sde: rw=6145, sector=21728, nr_sectors = 32 limit=0
btrfs: attempt to access beyond end of device
sde: rw=6145, sector=21760, nr_sectors = 32 limit=0
BTRFS error (device sdd): bdev /dev/sde errs: wr 1, rd 0, flush 0, corrupt 0, gen 0
BTRFS error (device sdd): bdev /dev/sde errs: wr 2, rd 0, flush 0, corrupt 0, gen 0
BTRFS error (device sdd): bdev /dev/sde errs: wr 3, rd 0, flush 0, corrupt 0, gen 0
BTRFS error (device sdd): bdev /dev/sde errs: wr 3, rd 0, flush 1, corrupt 0, gen 0
btrfs: attempt to access beyond end of device
sde: rw=145409, sector=128, nr_sectors = 8 limit=0
BTRFS warning (device sdd): lost super block write due to IO error on /dev/sde (-5)
BTRFS error (device sdd): bdev /dev/sde errs: wr 4, rd 0, flush 1, corrupt 0, gen 0
btrfs: attempt to access beyond end of device
sde: rw=14337, sector=131072, nr_sectors = 8 limit=0
BTRFS warning (device sdd): lost super block write due to IO error on /dev/sde (-5)
BTRFS error (device sdd): bdev /dev/sde errs: wr 5, rd 0, flush 1, corrupt 0, gen 0
BTRFS error (device sdd): error writing primary super block to device 2
BTRFS info (device sdd): balance: start -dconvert=single -mconvert=dup -sconvert=dup
BTRFS info (device sdd): relocating block group 1372585984 flags data|raid1
BTRFS error (device sdd): bdev /dev/sde errs: wr 5, rd 0, flush 2, corrupt 0, gen 0
BTRFS warning (device sdd): chunk 2446327808 missing 1 devices, max tolerance is 0 for writable mount
BTRFS: error (device sdd) in write_all_supers:4044: errno=-5 IO failure (errors while submitting device barriers.)
BTRFS info (device sdd state E): forced readonly
BTRFS warning (device sdd state E): Skipping commit of aborted transaction.
BTRFS error (device sdd state EA): Transaction aborted (error -5)
BTRFS: error (device sdd state EA) in cleanup_transaction:2017: errno=-5 IO failure
BTRFS info (device sdd state EA): balance: ended with status: -5

[CAUSE]
The root cause is that, deleting devices using sysfs interface normally will trigger the shutdown callback for the fs.

But btrfs doesn't handle that callback at all, thus it can not really know that device is no longer avaialble, thus btrfs will still try to do usual read/write on that device.

This is fine if the user do nothing, as RAID1 can handle it properly.

But if we try to convert to SINGLE/DUP, btrfs will still use that device to allocate new data/metadata chunks.
And if a new metadata chunk is allocated to the removed device, all the write will be lost, and trigger the super block write/barrier errors above.

[USER SPACE ENHANCEMENT]
For now, add extra missing devices check at btrfs-balance command. If there is a missing devices, btrfs balance will add a 10 seconds delay and warn the possible dangerous.

The root fix is to introduce a failing/removed device detection for btrfs, but that will be a pretty big feature and will take quite some time before landing it upstream.

Reported-by: Jeff Siddall [email protected]
Link: https://lore.kernel.org/linux-btrfs/[email protected]/

Move the option documentation next to --rootdir, reword documentation. Signed-off-by: David Sterba <[email protected]>

The compression support is optional, eg. also in 'btrfs-restore', so print the support in help text. usage: mkfs.btrfs [options] <dev> [<dev...>] ... --compress ALGO[:LEVEL] compress files by algorithm and level, ALGO can be 'no' (default), zstd, lzo, zlib Built-in: - ZSTD: yes - LZO: yes - ZLIB: yes ... Signed-off-by: David Sterba <[email protected]>

Put option parsing to its own helper to keep the switch/case smaller. Signed-off-by: David Sterba <[email protected]>

The level is a small number, no need to pass it around as u64. Signed-off-by: David Sterba <[email protected]>

The constants will be used in main() to validate command line options. Signed-off-by: David Sterba <[email protected]>

Report invalid compression specification while parsing the options. Now an ivalid level won't be silently accepted and capped when processing the files. Other checks regarding conditional support of LZO and ZSTD are left in place. Signed-off-by: David Sterba <[email protected]>

Signed-off-by: David Sterba <[email protected]>

It does not make sense to pass only the compression option when there are no files being added by --rootdir. Signed-off-by: David Sterba <[email protected]>

$ mkfs.btrfs --compress zlib:9 --rootdir Documentation img ... Rootdir from: /path/Documentation Compress: zlib:9 Shrink: no ... Signed-off-by: David Sterba <[email protected]>

There are two configure options for compression so group them and update the description so it mentions all commands that use compression. Signed-off-by: David Sterba <[email protected]>

For completeness and parity with other tools add the option --version and mention --help. Signed-off-by: David Sterba <[email protected]>

Enhance information in the help text where some interesting information was not missing and would require looking up the documentation. Signed-off-by: David Sterba <[email protected]>

The new option for experimental featues was added in 6.12, mention it where the DEBUG option was previously used for the same purpose. [ci skip] Signed-off-by: David Sterba <[email protected]>

[BUG] There is a bug report that read-only scrub on a read-write fs still causes writes into the fs, and that will be caught if there is a read-only block device among the storage stack. This will cause a kernel warning on failed transaction commit: BTRFS info (device dm-3): first mount of filesystem e18f0c40-88de-413f-9d7e-dcc8136ad6dd BTRFS info (device dm-3): using crc32c (crc32c-intel) checksum algorithm BTRFS info (device dm-3): using free-space-tree BTRFS info (device dm-3): scrub: started on devid 1 Trying to write to read-only block-device md127 btrfs_dev_stat_inc_and_print: 362 callbacks suppressed BTRFS error (device dm-3): bdev /dev/mapper/data errs: wr 1, rd 0, flush 0, corrupt 0, gen 0 BTRFS error (device dm-3): bdev /dev/mapper/data errs: wr 2, rd 0, flush 0, corrupt 0, gen 0 BTRFS error (device dm-3): bdev /dev/mapper/data errs: wr 3, rd 0, flush 0, corrupt 0, gen 0 BTRFS error (device dm-3): bdev /dev/mapper/data errs: wr 4, rd 0, flush 0, corrupt 0, gen 0 BTRFS error (device dm-3): bdev /dev/mapper/data errs: wr 5, rd 0, flush 0, corrupt 0, gen 0 BTRFS error (device dm-3): bdev /dev/mapper/data errs: wr 6, rd 0, flush 0, corrupt 0, gen 0 BTRFS error (device dm-3): bdev /dev/mapper/data errs: wr 7, rd 0, flush 0, corrupt 0, gen 0 BTRFS error (device dm-3): bdev /dev/mapper/data errs: wr 8, rd 0, flush 0, corrupt 0, gen 0 BTRFS error (device dm-3): bdev /dev/mapper/data errs: wr 9, rd 0, flush 0, corrupt 0, gen 0 BTRFS error (device dm-3): bdev /dev/mapper/data errs: wr 10, rd 0, flush 0, corrupt 0, gen 0 BTRFS: error (device dm-3) in btrfs_commit_transaction:2523: errno=-5 IO failure (Error while writing out transaction) BTRFS info (device dm-3 state E): forced readonly BTRFS warning (device dm-3 state E): Skipping commit of aborted transaction. BTRFS error (device dm-3 state EA): Transaction aborted (error -5) BTRFS: error (device dm-3 state EA) in cleanup_transaction:2017: errno=-5 IO failure BTRFS warning (device dm-3 state EA): failed setting block group ro: -5 BTRFS info (device dm-3 state EA): scrub: not finished on devid 1 with status: -5 [CAUSE] The root cause is inside btrfs_inc_block_group_ro(), where we need to hold a transaction handle, to prevent the transaction to be committed, until we hold ro_block_group_mutex. This will cause an empty transaction by itself, thus even if we can mark the block group read-only without any extra workload, we still need to commit the new and empty transaction. Unfortunately this means RO scrub on RW filesystem will always cause the fs to be updated. [FIX] The best fix is to make btrfs to avoid empty commit transaction, but even with that done, read-only scrub on rw mount can still cause real metadata updates (e.g. allocate new chunks and update device error statistics). It will be very complex to make read-only scrub to be fully read-only on a read-write btrfs. Thankfully read-only scrub on read-write mount with read-only device in the storage stack is pretty rare, thus a documentation update should be enough. Issue: kdave#934 Pull-request: kdave#935 Signed-off-by: Qu Wenruo <[email protected]> Signed-off-by: David Sterba <[email protected]>

If qgroups are not enabled the 'clear-stale' command unconditionally calls sync on the whole filesystem, which is unnecessary and can slow down the system. Do a check first if qgroups are enabled. Issue: kdave#942 Signed-off-by: David Sterba <[email protected]>

There's a report that handling of --help in connection with other option is confusing in some cases: Various sub-commands to btrfs claim --help is an unrecognized option if there are any other options on the CLI. Examples of --help being unrecognized. $ btrfs filesystem defragment -v --help btrfs filesystem defragment: unrecognized option '--help' Try 'btrfs filesystem defragment --help' for more information $ btrfs balance start -v --help $ btrfs balabtrfs balance start -v --help btrfs balance start: unrecognized option '--help' Try 'btrfs balance start --help' for more information Alternatively, some sub-commands support --help even if there are extra options on the CLI: $ btrfs filesystem df -v --help usage: btrfs filesystem df [options] <path> Show space usage information for a mount point -b|--raw raw numbers in bytes -h|--human-readable human friendly numbers, base 1024 (default) -H human friendly numbers, base 1000 --iec use 1024 as a base (KiB, MiB, GiB, TiB) --si use 1000 as a base (kB, MB, GB, TB) -k|--kbytes show sizes in KiB, or kB with --si -m|--mbytes show sizes in MiB, or MB with --si -g|--gbytes show sizes in GiB, or GB with --si -t|--tbytes show sizes in TiB, or TB with --si Global options: --format TYPE where TYPE is: text Update clean_args_no_options() to detect unrecognized options properly and do not show full help text (which usued to be done in the past but is confusing and not pointing to the problem). There's also a mixup of global verbosity options and command-specific verbosity options (now deprecated), so in some commands they are recognized. Also handle the --help option for commands without options so the following works: $ btrfs fi df -v --help / btrfs filesystem df: invalid option 'v' Try 'btrfs filesystem df --help' for more information $ btrfs fi df -H --help / usage: btrfs filesystem df [options] <path> Show space usage information for a mount point -b|--raw raw numbers in byte ... Issue: kdave#889 Signed-off-by: David Sterba <[email protected]>

Signed-off-by: David Sterba <[email protected]>

By mistake in the original commit 992fd23 ("btrfs-progs: add nodiscard option to device add") the --nodiscard could take an optional parameter but this was not intended. This is also not documented so there's little chance somebody would actually use it like that. Signed-off-by: David Sterba <[email protected]>

…vice [BUG] There is a reproducer that can trigger btrfs to flips RO: # mkfs.btrfs -f -mraid1 -draid1 /dev/sdd /dev/sde # mount /dev/sdd /mnt/btrfs # echo 1 > /sys/block/sde/device/delete # btrfs balance start -mconvert=dup -dconvert=single /mnt/btrfs ERROR: error during balancing '.': Input/output error There may be more info in syslog - try dmesg | tail Then btrfs will flip read-only with the following errors: btrfs: attempt to access beyond end of device sde: rw=6145, sector=21696, nr_sectors = 32 limit=0 btrfs: attempt to access beyond end of device sde: rw=6145, sector=21728, nr_sectors = 32 limit=0 btrfs: attempt to access beyond end of device sde: rw=6145, sector=21760, nr_sectors = 32 limit=0 BTRFS error (device sdd): bdev /dev/sde errs: wr 1, rd 0, flush 0, corrupt 0, gen 0 BTRFS error (device sdd): bdev /dev/sde errs: wr 2, rd 0, flush 0, corrupt 0, gen 0 BTRFS error (device sdd): bdev /dev/sde errs: wr 3, rd 0, flush 0, corrupt 0, gen 0 BTRFS error (device sdd): bdev /dev/sde errs: wr 3, rd 0, flush 1, corrupt 0, gen 0 btrfs: attempt to access beyond end of device sde: rw=145409, sector=128, nr_sectors = 8 limit=0 BTRFS warning (device sdd): lost super block write due to IO error on /dev/sde (-5) BTRFS error (device sdd): bdev /dev/sde errs: wr 4, rd 0, flush 1, corrupt 0, gen 0 btrfs: attempt to access beyond end of device sde: rw=14337, sector=131072, nr_sectors = 8 limit=0 BTRFS warning (device sdd): lost super block write due to IO error on /dev/sde (-5) BTRFS error (device sdd): bdev /dev/sde errs: wr 5, rd 0, flush 1, corrupt 0, gen 0 BTRFS error (device sdd): error writing primary super block to device 2 BTRFS info (device sdd): balance: start -dconvert=single -mconvert=dup -sconvert=dup BTRFS info (device sdd): relocating block group 1372585984 flags data|raid1 BTRFS error (device sdd): bdev /dev/sde errs: wr 5, rd 0, flush 2, corrupt 0, gen 0 BTRFS warning (device sdd): chunk 2446327808 missing 1 devices, max tolerance is 0 for writable mount BTRFS: error (device sdd) in write_all_supers:4044: errno=-5 IO failure (errors while submitting device barriers.) BTRFS info (device sdd state E): forced readonly BTRFS warning (device sdd state E): Skipping commit of aborted transaction. BTRFS error (device sdd state EA): Transaction aborted (error -5) BTRFS: error (device sdd state EA) in cleanup_transaction:2017: errno=-5 IO failure BTRFS info (device sdd state EA): balance: ended with status: -5 [CAUSE] The root cause is that, deleting devices using sysfs interface normally will trigger the shutdown callback for the fs. But btrfs doesn't handle that callback at all, thus it can not really know that device is no longer avaialble, thus btrfs will still try to do usual read/write on that device. This is fine if the user do nothing, as RAID1 can handle it properly. But if we try to convert to SINGLE/DUP, btrfs will still use that device to allocate new data/metadata chunks. And if a new metadata chunk is allocated to the removed device, all the write will be lost, and trigger the super block write/barrier errors above. [USER SPACE ENHANCEMENT] For now, add extra missing devices check at btrfs-balance command. If there is a missing devices, `btrfs balance` will add a 10 seconds delay and warn the possible dangerous. The root fix is to introduce a failing/removed device detection for btrfs, but that will be a pretty big feature and will take quite some time before landing it upstream. Reported-by: Jeff Siddall <[email protected]> Link: https://lore.kernel.org/linux-btrfs/[email protected]/ Signed-off-by: Qu Wenruo <[email protected]>

kdave · 2025-02-05T23:12:40Z

cmds/balance.c

+		 * Kernel will report the device path even if we can no
+		 * longer access it anymore. So we have to manually check it.
+		 */
+		fd = open((char *)cur_dev_info->path, O_RDONLY);


Do we really need to open() it? Would stat() be sufficient? I think the device node in /dev/ also disappears so we don't need to distinguish path existence and actual block device behind. But there may be some exception.

kdave · 2025-02-05T23:14:12Z

cmds/balance.c

+		struct btrfs_ioctl_dev_info_args *cur_dev_info;
+		int fd;
+
+		cur_dev_info = (struct btrfs_ioctl_dev_info_args *)&dev_info_args[i];


This does not seem to be necessary, there's only one use of cur_dev_info, so you can drop the temporary variable.

kdave · 2025-02-05T23:15:23Z

cmds/balance.c

+		if (ret == 0)
+			continue;
+		convert_warned = true;
+		printf("WARNING:\n\n");


warning("...")

kdave · 2025-02-05T23:15:53Z

cmds/balance.c

+		printf("\tConversion with missing device(s) can be dangerous.\n");
+		printf("\tPlease use `btrfs replace` or `btrfs device remove` instead.\n");
+		if (force) {
+			printf("\tSafety timeout skipped due to --force\n\n");


pr_verbose(LOG_DEFAULT, "...")

kdave and others added 23 commits January 22, 2025 20:34

btrfs-profs: mkfs: update compression help and documentation

2bf30ba

Move the option documentation next to --rootdir, reword documentation. Signed-off-by: David Sterba <[email protected]>

btrfs-progs: mkfs: factor out compression option parsing

83d1c5c

Put option parsing to its own helper to keep the switch/case smaller. Signed-off-by: David Sterba <[email protected]>

btrfs-progs: mkfs: change compression level type to 32bit type

ecc28c5

The level is a small number, no need to pass it around as u64. Signed-off-by: David Sterba <[email protected]>

btrfs-progs: mkfs: move compression level definitions to rootdir.h

a2cac37

The constants will be used in main() to validate command line options. Signed-off-by: David Sterba <[email protected]>

btrfs-progs: mkfs: document level ranges for zlib and zstd in help

b81c839

Signed-off-by: David Sterba <[email protected]>

btrfs-progs: mkfs: add --compress and --rootdir constraints

4a5e3fa

It does not make sense to pass only the compression option when there are no files being added by --rootdir. Signed-off-by: David Sterba <[email protected]>

btrfs-progs: mkfs: print in summary if compression is used

dfee618

$ mkfs.btrfs --compress zlib:9 --rootdir Documentation img ... Rootdir from: /path/Documentation Compress: zlib:9 Shrink: no ... Signed-off-by: David Sterba <[email protected]>

btrfs-progs: build: move --disable-lzo option closer to zstd

57ca924

There are two configure options for compression so group them and update the description so it mentions all commands that use compression. Signed-off-by: David Sterba <[email protected]>

btrfs-progs: image: add option --version, update help text

18d71bb

For completeness and parity with other tools add the option --version and mention --help. Signed-off-by: David Sterba <[email protected]>

btrfs-progs: btrfstune: add option --version, update help text

5e5764a

For completeness and parity with other tools add the option --version and mention --help. Signed-off-by: David Sterba <[email protected]>

btrfs-progs: convert: add option --version, update help text

6db9705

For completeness and parity with other tools add the option --version and mention --help. Signed-off-by: David Sterba <[email protected]>

btrfs-progs: mkfs: update help text

f4a72e3

Enhance information in the help text where some interesting information was not missing and would require looking up the documentation. Signed-off-by: David Sterba <[email protected]>

btrfs-progs: docs: mention CONFIG_BTRFS_EXPERIMENTAL if relevant

8c809fd

The new option for experimental featues was added in 6.12, mention it where the DEBUG option was previously used for the same purpose. [ci skip] Signed-off-by: David Sterba <[email protected]>

btrfs-progs: tests: add cases to combine --help with other options

100f8e5

Signed-off-by: David Sterba <[email protected]>

btrfs-progs: docs: move command line guidelines do dev

6eb0e9b

Signed-off-by: David Sterba <[email protected]>

btrfs-progs: docs: update command line and UI conventions

44bd0f6

Signed-off-by: David Sterba <[email protected]>

kdave requested changes Feb 5, 2025

View reviewed changes

kdave force-pushed the devel branch 2 times, most recently from 24ab173 to 800d7b0 Compare February 11, 2025 19:43

kdave added this to the v6.13.1 milestone Feb 12, 2025

kdave force-pushed the devel branch 3 times, most recently from 1480f4a to 103338d Compare February 17, 2025 17:30

kdave force-pushed the devel branch from 103338d to 8c4549c Compare February 17, 2025 20:16

kdave force-pushed the devel branch 2 times, most recently from 639bee9 to 6be1d80 Compare March 11, 2025 21:15

kdave modified the milestones: v6.13.1, v6.14 Mar 18, 2025

kdave force-pushed the devel branch 2 times, most recently from c9ae825 to 5ad147c Compare March 26, 2025 19:41

kdave modified the milestones: v6.14, v6.14.1 Mar 26, 2025

kdave force-pushed the devel branch 6 times, most recently from ef43ce6 to 3eff852 Compare May 30, 2025 14:14

kdave modified the milestones: v6.14.1, v6.15, v6.15.1 May 30, 2025

adam900710 force-pushed the devel branch from 7b69bea to 96d5e91 Compare June 5, 2025 22:54

kdave force-pushed the devel branch 4 times, most recently from dafafca to 5d47f58 Compare June 20, 2025 19:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

btrfs-progs: balance: add extra delay if converting with a missing de… #946

btrfs-progs: balance: add extra delay if converting with a missing de… #946

Uh oh!

adam900710 commented Feb 3, 2025

Uh oh!

kdave Feb 5, 2025

Uh oh!

kdave Feb 5, 2025

Uh oh!

kdave Feb 5, 2025

Uh oh!

kdave Feb 5, 2025

Uh oh!

Uh oh!

btrfs-progs: balance: add extra delay if converting with a missing de… #946

Are you sure you want to change the base?

btrfs-progs: balance: add extra delay if converting with a missing de… #946

Uh oh!

Conversation

adam900710 commented Feb 3, 2025

mkfs.btrfs -f -mraid1 -draid1 /dev/sdd /dev/sde

mount /dev/sdd /mnt/btrfs

echo 1 > /sys/block/sde/device/delete

btrfs balance start -mconvert=dup -dconvert=single /mnt/btrfs

Uh oh!

kdave Feb 5, 2025

Choose a reason for hiding this comment

Uh oh!

kdave Feb 5, 2025

Choose a reason for hiding this comment

Uh oh!

kdave Feb 5, 2025

Choose a reason for hiding this comment

Uh oh!

kdave Feb 5, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!