Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support 4k storage #4974

Closed
ij1 opened this issue Apr 13, 2019 · 58 comments · Fixed by QubesOS/qubes-linux-utils#119
Closed

Support 4k storage #4974

ij1 opened this issue Apr 13, 2019 · 58 comments · Fixed by QubesOS/qubes-linux-utils#119
Assignees
Labels
C: installer C: storage hardware support P: default Priority: default. Default priority for new issues, to be replaced given sufficient information.

Comments

@ij1
Copy link

ij1 commented Apr 13, 2019

Qubes OS version

R4.0

Affected component(s) or functionality

VMs not working/starting right from a fresh install.

Brief summary

Right after a fresh install, all VMs fail to mount root and therefore fails to start beyond the point where they expect /dev/xvda3 available. This happens on a device that has 4kB logical and physical block sizes (NVMe drive). This was not problem in R3.2 (as it used files by default for VM storage).

To Reproduce

Steps to reproduce the behavior:

  1. Install Qubes to a drive with 4kB sector size (both logical / physical); (I put /boot to a SATA drive with 512B sectors to avoid BIOS/NVMe boot challenges, rest of the system is on the NVMe with 4kB sectors).
  2. Firstboot stuff fails
  3. After clicking "finish" for firstboot, find out that no VM will start successfully (which explains firstboot failures I guess)
  4. Look to the VM logs, and find this from there:
[    0.887548] blkfront: xvda: flush diskcache: enabled; persistent grants: enabled; indirect descriptors: enabled;
[    0.902355] blkfront: xvdb: flush diskcache: enabled; persistent grants: enabled; indirect descriptors: enabled;
[    0.924386] blkfront: xvdc: flush diskcache: enabled; persistent grants: enabled; indirect descriptors: enabled;
[    0.940325] blkfront: xvdd: flush diskcache: enabled; persistent grants: enabled; indirect descriptors: enabled;
Waiting for /dev/xvda* devices...
Qubes: Doing R/W setup for TemplateVM...
[    1.049451] random: sfdisk: uninitialized urandom read (4 bytes read)
[    1.052481]  xvdc: xvdc1
[    1.060250] random: mkswap: uninitialized urandom read (16 bytes read)
Setting up swapspace version 1, size = 8 GiB (8589930496 bytes)
no label, UUID=...
Qubes: done.
mount: wrong fs type, bad option, bad superblock on /dev/xvda,
       missing codepage or helper program, or other error

       In some cases useful info is found in syslog - try
       dmesg | tail or so.
Waiting for /dev/xvdd device...
mount: /dev/xvdd is write-protected, mounting read-only
[    1.099814] EXT4-fs (xvdd): mounting ext3 file system using the ext4 subsystem
[    1.106796] EXT4-fs (xvdd): mounted filesystem with ordered data mode. Opts: (null)
mount: /sysroot not mounted or bad option

       In some cases useful info is found in syslog - try
       dmesg | tail or so.
[    1.119049] clocksource: tsc: mask: 0xffffffffffffffff max_cycles: 0x1e335a008d5, max_idle_ns: 440795216613 ns
mount: /sysroot not mounted or bad option

       In some cases useful info is found in syslog - try
       dmesg | tail or so.
switch_root: failed to mount moving /sysroot to /: Invalid argument
switch_root: failed. Sorry.
[    1.217841] Kernel panic - not syncing: Attempted to kill init! exitcode=0x00000100
...

Expected behavior

VMs would start. Firstboot stuff would work. Drives with 4kB sector size would work.

Additional context

I've tracked this down to the handling of the partition table. With 512B sectors the location of the GPT differs from that of with 4kB sectors and therefore VMs fail to find the correct partition table from xvda. Obviously also the partition start/end values will be off by the factor of 8 because the templates are built(?) with an assumption of 512B sector size.

I'm not sure if there are other assumptions based on 512B sectors with the other /dev/xvd* drives.

Solutions you've tried

I cloned a template and I tried to manually fix the partition table of the clone (in dom0 through /dev/qubes_dom0/...). There's was plenty of space before the first partition, however, at the end the drive is so tight on space that the GPT secondary table won't fit so the xvda3 partition's tail was truncated slightly and I didn't try to resize its filesystem first (this probably causes some problems, potentially corruption?). With such a fixed partition table, I could start VMs (but there are then some other problems/oddities that might be due to incomplete firstboot or non-fixed fedora template, I only fixed the debian one which I mainly use normally). I could possibly enlarge the relevant LV slightly to avoid the truncate problem at the tail of xvda3 but I've not tried that yet.

I tried to look if I could somehow force pv/vg/lv chain to fake the logical sector size but couldn't find anything from the manpages.

Libvirt might be able to fake the logical_block_size but I've not yet tried that.

Relevant documentation you've consulted

During install, I used the custom install steps to create manual partitioning (but I think it is irrelevant).

Related, non-duplicate issues

None I could find, some other issues included failure to mount root successfully but the causes are different.

Decided solution

Add a partition table conversion to initramfs. Specifically, write a tool that would check if partition table matches current block size. If it matches, do nothing. If not, convert it to the right block size format before mounting anything. And destroy the wrong partition table (if isn't directly overridden by the converted one) to prevent confusion which one is the current one.

References:
#4974 (comment)
#4974 (comment)

@ij1 ij1 added P: default Priority: default. Default priority for new issues, to be replaced given sufficient information. T: bug labels Apr 13, 2019
@marmarek
Copy link
Member

Sector size is advertised by the block backend in xenstore (xenstore-ls /local/domain/0/backend/vbd/$DOMID/51712), but I don't see any option to force specific value.

This issue is really unfortunate, because a lot of places in Qubes assume you can freely transfer disk image and it will work just fine. This include cloning VMs (including cloning to a different storage pool), backup/restore etc. So, the solution here should be either:

  • find a way to construct a disk image to work on both 4k and 512 sector size
  • force VM to see 512 sector size

The second one may come with a performance penalty. The first one would not have this problem, but not sure if it's possible. I'm fine with making partition table 4K aligned, as long as it will also work with 512 sector size. But it isn't clear to me it would be enough.

Partition table and filesystem are built here: https://github.com/QubesOS/qubes-linux-template-builder/blob/master/prepare_image#L63-L83

Another idea would be to revert to a filesystem directly on /dev/xvda (without any partition table). This may not be as simple as it sounds, because we need to fit grub somewhere (with HVM with in-VM kernel case).

But this all may not work for other cases, including other OS. Imagine installing some OS (Linux, Windows, whatever) in a standalone HVM and then moving it to another storage pool (or restoring a backup on another machine). Those cases may require emulating constant sector size.

Sadly, I don't have any hardware with 4k physical sector size to test on. I'll try to find a way to emulate one.

BTW, another issue from 4k sector size is 8GB of swap, instead of 1GB. But this should be easy to fix in this script

@marmarek
Copy link
Member

A lot of useful info: https://superuser.com/questions/679725/how-to-correct-512-byte-sector-mbr-on-a-4096-byte-sector-disk
There is also a script to parse 512-byte GPT on 4k disk (and map it using loop devices).
Using this, one workaround would be to adjust init.sh to rewrite GPT if sector mismatch is detected (in either direction). This require the partitions to be 4k aligned before, but it should be doable.
But this is far from a complete solution, given non-template-based Linux use cases.

@ij1
Copy link
Author

ij1 commented Apr 14, 2019

There's not much to worry about 4k alignment, it is already there in the template: what I gathered, the partition table tools nowadays will enforce at least 4k alignment and they will warn if that would be violated (some might do even larger alignment). This is why I managed to rewrite the template's partition table in the first place so easily (except the truncate issue).

I don't think forcing 512 sector size itself would come with a large penalty as in practice the filesystems inside will use something larger than 512 (depending on how all relevant block stuff handles the larger continuous units of course but I'd guess that would not cause performance problems). So it would be mostly relevant for booting up correctly. What I'd rather avoid though, is forcing my drive's firmware to use 512 sector size as it would explore less tested corners of the firmware and possibly have significant performance impact too (I know my NVMe drive could do 512 but I don't know if all 4k drives are able so this needs to be handled anyway).

Btw, the USB HDDs might expose 4kB when not behind the SATA-to-USB converter, perhaps you have one of them which you can disassemble to get such a device? (losetup seems able to fake it as noted below)

Like I said, libvirt supposedly has a way to configure logical_block_size but I don't know if that is able to fake it for real:
https://libvirt.org/formatdomain.html
...or is that only for KVM?

I'll probably try to use the file backend (that's what was used in R3.2 right?) for the main system for now (the NVMe drive should clone fast anyway :-) so the biggest downside I know of is a non-issue). Can the installer do that automatically if I simply reinstall (that is, how it chooses which type of storage pool to use by default) or do I have to manually setup everything afterwards skipping the firstboot stuff to avoid it failing? I can then look into the 4k stuff while other stuff keeps working fine with 512. That would also allow me to easily test cross 512 and 4k copying but that looks rather scary to begin with, so far about nothing seems portable from one sector size to another from what I've read.

If the partition table would be removed from xvda, the grub might have a similar 4k vs 512 issue anyway so that might not solve anything (sector was mentioned somewhere when I tried to look into what kind of information format it uses which sound bad) but this needs a deeper investigation.

@ij1
Copy link
Author

ij1 commented Apr 14, 2019

Losetup seems able to fake logical block size:

util-linux/util-linux@a1a4159

@marmarek
Copy link
Member

...or is that only for KVM?

Yes, I think it's KVM only.

Can the installer do that automatically if I simply reinstall

If you choose btrfs during installation, Qubes will use that instead of LVM.

@ij1
Copy link
Author

ij1 commented Apr 15, 2019

Could not the faking be done other way around? I'd feel by intuition that in block code log4096 -> phy512 is far simpler than log512 -> phy4096.

Or is there some particular reason why 512 is still needed for the VM disk format that is almost internal to Qubes. VMs will obviously see the end result but they should have little reason to change how the sector size is defined by the "internal" format. Or is there some other OS that only works with 512?

That would leave just a few things to address:

  • block support for log4096 -> phy512 (or any 2^n to be more generic).
  • Template building code to create 4k GPT (sfdisk doesn't seem to support forcing sector size unlike some other partition tools but losetup could be used to fake it).
  • Fix the 512 assumptions (such as the one with swap partition sizing), hopefully not that many
  • One-time migration (at new release?) which would be at most as complex as a GPT rewrite workaround needed for supporting both sizes, probably somewhat less.

@marmarek
Copy link
Member

Or is there some particular reason why 512 is still needed for the VM disk format that is almost internal to Qubes

I'm not sure about disks emulated by QEMU. And then windows PV drivers. Recently I've seen some patches flying around fixing 512 sector size assumption somewhere there, so there still may be more issues like this.
Given various elements involved, I think 512 is simply safer in terms of compatibility.

@brendanhoar
Copy link

brendanhoar commented Jul 8, 2019

Can I throw in another alignment data point to consider: the LVM chunk_size, which can range from 64KB to 1MB.

Policy-wise, Qubes may want to consider ensuring that any physical partitions (or partitions inside lvm LVs), that are created by qubes tools and/or installer, are 1MB aligned, primarily for performance reasons. Probably not as critical as the baseline fixes to ensure 4K logical sector drives work, but since that requires changes, consider enforcing a much more strict alignment going forward (see the volatile volume issue #5151).

Brendan

@arno01
Copy link

arno01 commented Jan 19, 2020

If anyone needs 4Kn templates right now, can use my patch from https://gist.github.com/arno01/ae31e1e9098591dadde3d1fc8c707000

I have also found that partprobe will fail to spawn the partitions off loop devices created with the custom sector size (losetup: -b / --sector-size) not corresponding to the sector size of the backing disk on Linux < 4.18-rc4.


And there is some interesting discussion about the 4Kn sector disks.
IIUC, the point Alan Cox makes there is that this kind of problem should be solved at the partitioning level, not at the xenbus / LVM / Linux kernel.

@rustybird
Copy link

rustybird commented Oct 22, 2022

This will become a bigger problem with R4.2, where cryptsetup >= 2.4.0 (Fedora >= 35) will automatically create dm-crypt with 4K logical sectors even on 512e drives.

Ideas:

  • Invoke cryptsetup luksFormat with an explicit --sector-size=512 argument for the LVM Thin installation layout (fix for 512e drives)

  • Or attach thin volumes to the VM via loop devices dm-ebs (fix for 4Kn and 512e drives)

    • With an optimization to skip the loop device dm-ebs setup (passing through the thin volume) when it already has the right logical block size, and a Volume.logical_block_size property (defaulting to 512) it could be a way to gradually opt into 4K storage volumes in general.

@brendanhoar
Copy link

I wonder if Qubes pools should specify the sector size of their underlying storage technology, and whether importing volumes should involve a conversion step?

B

@rustybird
Copy link

rustybird commented Oct 22, 2022

Conversion during import would mean parsing VM data in dom0 😬

Or a DisposableVM I guess.

@rustybird
Copy link

rustybird commented Oct 22, 2022

Ok someone should definitely write a DisposableVM-powered converter for common volume contents.

But automatic conversion won't be possible in all cases (like standalone HVMs where a volume could contain anything, e.g. bs dependent filesystems like XFS that might not be straightforward to upgrade), so even with a very good converter there's still a need for

  • per-volume metadata recording the appropriate bs for its current content
  • a mechanism to present the volume to the VM with that bs, even if the storage pool's ideal bs is different, e.g. after restoring from a backup

@HW42
Copy link

HW42 commented Oct 22, 2022

This will become a bigger problem with R4.2, where cryptsetup >= 2.4.0 (Fedora >= 35) will automatically create dm-crypt with 4K logical sectors even on 512e drives.

Interesting. Do you know how they implement this? Because I thought this direction is the tricky one, because a block device should guarantee atomic writes per sector (in other words you should always see either the version before the write or a fully updated sector, but not a mix). So a proper implementation likely needs a journal.

  • a mechanism to present the volume to the VM with that bs, even if the storage pool's ideal bs is different, e.g. after restoring from a backup

At lest on dm level support seems to exists: https://www.kernel.org/doc/html/latest/admin-guide/device-mapper/dm-ebs.html

@rustybird
Copy link

cryptsetup >= 2.4.0 (Fedora >= 35) will automatically create dm-crypt with 4K logical sectors even on 512e drives.

Do you know how they implement this? Because I thought this direction is the tricky one, because a block device should guarantee atomic writes per sector

I'm kinda curious too about how writes really work for kernel -> 512e drive communication.

Pure speculation: Since both the kernel and the drive know that the drive's physical block size is 4K, maybe the kernel just always writes batches of 8 * 512B logical blocks - and when the drive sees logical blocks coming in fast enough, one immediately following another, it figures out that read-modify-write can be avoided? Or there could be some explicit way for the kernel to signal to the drive that it's aware of 512e and that it guarantees to send 4K blocks merely encoded as batches of 512B blocks.

https://www.kernel.org/doc/html/latest/admin-guide/device-mapper/dm-ebs.html

Huh. Thanks! Wonder if that's better than a loop device.

@DemiMarie
Copy link

I can think of at least two solutions:

  1. Place the partition table for a 512e device between the protective MBR and the 4Kn GPT. There are 7 512-byte sectors in this space, which allows for up to 6 partitions. This is enough to fit all three partitions used by Qubes OS, plus one extra covering the 4Kn partition table. The only problem with this approach is that if the block size is 4K, the protective MBR will appear to extend past the end of the device. I suspect this is harmless.
  2. Use a partition table that is not part of the image, but instead is overlayed on it at runtime. This can be done by using dm-linear.

@rustybird
Copy link

rustybird commented Oct 24, 2022

@DemiMarie I don't get (1). Would there be some script in the VM's initrd to rewrite the partition table ("activating" the stashed away 512B or 4K version) depending on xvda's current logical block size?

Dynamically switching back and forth between 512B and 4K partitioning in general seems like it could make resizing the volume (resize-rootfs-if-needed.sh and resize-rootfs) a little scary...

@marmarek
Copy link
Member

Generally, I'd try to avoid any kind of conversion at startup and go for emulation when necessary. That means:

  • recording expected block size by the volume
  • emulating it, if the underlying pool has different block size

And then, either build templates with two flavors, or convert at the install time (as part of qvm-template-postprocess), if reasonably easy.

Can we get away without emulating 4k bs on 512 bs devices?

@rustybird
Copy link

Once there's a way to attach a volume as 4K, why even bother building (or converting to) 512B templates.

Can we get away without emulating 4k bs on 512 bs devices?

Forcing --sector-size=4096 for luksFormat in the installer (or reencrypt in the upgrade script) even on drives reporting 512B physical sectors would have the same effect.

I'd guess almost all of those drives (that make sense to install a Qubes storage pool on) actually have 4K physical sectors anyway, but it's misreported by shoddy firmware or an adapter.

@DemiMarie
Copy link

I currently plan to use the rust-gptman crate for this, as it is packaged in Fedora 41 and actually has things like test suites. It can be backported to older distros by bundling it.

@DemiMarie
Copy link

DemiMarie commented Nov 5, 2024

Looks like both gptman and gpt have the same bug: neither includes necessary calls to fsync(), and so both risk leaving the system unbootable in the event of power loss!

@DemiMarie
Copy link

@marmarek: can the tool that converts from 512 to 4K and back be provided with the previous sector size (perhaps from metadata somewhere), or does it need to guess the previous sector size? The latter could be problematic if there are multiple GPTs on the disk.

@marmarek
Copy link
Member

marmarek commented Nov 7, 2024

You can get the current sector size (from sysfs?) and check if the current GPT matches. If not, do the conversion.

@ejose19
Copy link

ejose19 commented Nov 18, 2024

For those coming to this issue, if the device supports going back to 512 bytes then you can apply the commands in #7398 (comment) as a temporary solution until this is fixed (currently using it myself and don't notice any performance issues, so far it has been even faster than 4k and btrfs)

marmarek added a commit to marmarek/anaconda that referenced this issue Nov 21, 2024
LVM (thin) volumes present the original sector size to the VM. Currently
all the qubes templates are built with 512 sector size (that's how
partition table and filesystem are created). When installing Qubes on a
4Kn disk, such VMs won't boot. LUKS2 supports passing down the sector
size, and cryptsetup used in Qubes 4.2 has it enabled by default.

As a temporary workaround, force 512 LUKS sector size on a LVM thin
partitioning layout.

QubesOS/qubes-issues#4974
@marmarek
Copy link
Member

Note to self: if we ever put something on the EFI partition of the template's root volume, create filesystem there with 4k block size.

DemiMarie added a commit to DemiMarie/qubes-linux-utils that referenced this issue Nov 25, 2024
This adjusts the GPT in the initramfs, and (obviously) requires the
previous commit.

Fixes: QubesOS/qubes-issues#4974
marmarek added a commit to QubesOS/qubes-linux-utils that referenced this issue Nov 26, 2024
* origin/pr/119: (89 commits)
  Fix error handling
  Revert "Try to avoid race conditions"
  Simplify exit code handling
  Revert "Allow searching for a partition and printing its number"
  gptfix: do not declare variables in for_each_used_gpt_entry()
  Better ambiguity errors for multiple matching partitions
  Fix breaking out of for_each_used_gpt_entry()
  Support search by both partition name and type UUID
  Fail if the device appears truncated
  Better error message when backup partition table doesn't fit
  gpt_adjust_sectors(): do not mutate provided GPT
  Iterate only over used GPT entries
  Add accessor functions for individual GPT entries
  Do not check the name of unused partition entries
  Have struct GPT store its sector size
  Allow selecting the compiler in the build environment
  Fix clang warnings
  Only reload the partition table if changes were made
  Allow searching for a partition and printing its number
  Quiet warnings when resizing partitions
  ...

Pull request description:

This has many problems:

- [x] Old partition tables are not cleared before writing the partition table entry that would overwrite them.
- [x] An invalid partition table header is not written before the partition table entries that the valid one would refer to.
- [x] A valid partition table header is written before the entries it refers to, not after.
- [x] CRC32 checksums are ~~not checked when reading (but are created when writing).~~ are now always checked.
- [ ] Lots and lots of debug prints.
- [ ] No tests (it does pass a local test, though).
- [x] No installation (so this file would not even be included in a built package!).
- [x] The fixer is recompiled even when there is no need to rebuild it.
- [ ] The default CFLAGS are meant for local development & debugging.
- [x] `NDEBUG` is forcibly undefined.

Fixes: QubesOS/qubes-issues#4974
DemiMarie added a commit to DemiMarie/qubes-linux-utils that referenced this issue Nov 30, 2024
This adjusts the GPT in the initramfs, and (obviously) requires the
previous commit.

Fixes: QubesOS/qubes-issues#4974
DemiMarie added a commit to DemiMarie/qubes-linux-utils that referenced this issue Nov 30, 2024
This adjusts the GPT in the initramfs, and (obviously) requires the
previous commit.

Fixes: QubesOS/qubes-issues#4974
DemiMarie added a commit to DemiMarie/qubes-linux-utils that referenced this issue Dec 3, 2024
This adjusts the GPT in the initramfs, and (obviously) requires the
previous commit.

Fixes: QubesOS/qubes-issues#4974
DemiMarie added a commit to DemiMarie/qubes-linux-utils that referenced this issue Dec 6, 2024
This adjusts the GPT in the initramfs, and (obviously) requires the
previous commit.

Fixes: QubesOS/qubes-issues#4974
DemiMarie added a commit to DemiMarie/qubes-linux-utils that referenced this issue Dec 8, 2024
This adjusts the GPT in the initramfs, and (obviously) requires the
previous commit.

Fixes: QubesOS/qubes-issues#4974
@github-project-automation github-project-automation bot moved this from In progress to Done in Current team tasks Dec 8, 2024
marmarek added a commit to marmarek/qubes-builderv2 that referenced this issue Dec 8, 2024
Bootloader isn't installed there yet, but having the filesystem allows
systemd to mount it (it really wants to, even if not booing in EFI
mode...) instead of failing.
And also, set 4k sector size to avoid compatibility issues.

QubesOS/qubes-issues#4974
@marmarek
Copy link
Member

It's green! https://openqa.qubes-os.org/tests/121681

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C: installer C: storage hardware support P: default Priority: default. Default priority for new issues, to be replaced given sufficient information.
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.