-
-
Notifications
You must be signed in to change notification settings - Fork 48
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support 4k storage #4974
Comments
Sector size is advertised by the block backend in xenstore ( This issue is really unfortunate, because a lot of places in Qubes assume you can freely transfer disk image and it will work just fine. This include cloning VMs (including cloning to a different storage pool), backup/restore etc. So, the solution here should be either:
The second one may come with a performance penalty. The first one would not have this problem, but not sure if it's possible. I'm fine with making partition table 4K aligned, as long as it will also work with 512 sector size. But it isn't clear to me it would be enough. Partition table and filesystem are built here: https://github.com/QubesOS/qubes-linux-template-builder/blob/master/prepare_image#L63-L83 Another idea would be to revert to a filesystem directly on /dev/xvda (without any partition table). This may not be as simple as it sounds, because we need to fit grub somewhere (with HVM with in-VM kernel case). But this all may not work for other cases, including other OS. Imagine installing some OS (Linux, Windows, whatever) in a standalone HVM and then moving it to another storage pool (or restoring a backup on another machine). Those cases may require emulating constant sector size. Sadly, I don't have any hardware with 4k physical sector size to test on. I'll try to find a way to emulate one. BTW, another issue from 4k sector size is 8GB of swap, instead of 1GB. But this should be easy to fix in this script |
A lot of useful info: https://superuser.com/questions/679725/how-to-correct-512-byte-sector-mbr-on-a-4096-byte-sector-disk |
There's not much to worry about 4k alignment, it is already there in the template: what I gathered, the partition table tools nowadays will enforce at least 4k alignment and they will warn if that would be violated (some might do even larger alignment). This is why I managed to rewrite the template's partition table in the first place so easily (except the truncate issue). I don't think forcing 512 sector size itself would come with a large penalty as in practice the filesystems inside will use something larger than 512 (depending on how all relevant block stuff handles the larger continuous units of course but I'd guess that would not cause performance problems). So it would be mostly relevant for booting up correctly. What I'd rather avoid though, is forcing my drive's firmware to use 512 sector size as it would explore less tested corners of the firmware and possibly have significant performance impact too (I know my NVMe drive could do 512 but I don't know if all 4k drives are able so this needs to be handled anyway).
Like I said, libvirt supposedly has a way to configure I'll probably try to use the file backend (that's what was used in R3.2 right?) for the main system for now (the NVMe drive should clone fast anyway :-) so the biggest downside I know of is a non-issue). Can the installer do that automatically if I simply reinstall (that is, how it chooses which type of storage pool to use by default) or do I have to manually setup everything afterwards skipping the firstboot stuff to avoid it failing? I can then look into the 4k stuff while other stuff keeps working fine with 512. That would also allow me to easily test cross 512 and 4k copying but that looks rather scary to begin with, so far about nothing seems portable from one sector size to another from what I've read. If the partition table would be removed from xvda, the grub might have a similar 4k vs 512 issue anyway so that might not solve anything (sector was mentioned somewhere when I tried to look into what kind of information format it uses which sound bad) but this needs a deeper investigation. |
Losetup seems able to fake logical block size: |
Yes, I think it's KVM only.
If you choose btrfs during installation, Qubes will use that instead of LVM. |
Could not the faking be done other way around? I'd feel by intuition that in block code log4096 -> phy512 is far simpler than log512 -> phy4096. Or is there some particular reason why 512 is still needed for the VM disk format that is almost internal to Qubes. VMs will obviously see the end result but they should have little reason to change how the sector size is defined by the "internal" format. Or is there some other OS that only works with 512? That would leave just a few things to address:
|
I'm not sure about disks emulated by QEMU. And then windows PV drivers. Recently I've seen some patches flying around fixing 512 sector size assumption somewhere there, so there still may be more issues like this. |
Can I throw in another alignment data point to consider: the LVM chunk_size, which can range from 64KB to 1MB. Policy-wise, Qubes may want to consider ensuring that any physical partitions (or partitions inside lvm LVs), that are created by qubes tools and/or installer, are 1MB aligned, primarily for performance reasons. Probably not as critical as the baseline fixes to ensure 4K logical sector drives work, but since that requires changes, consider enforcing a much more strict alignment going forward (see the volatile volume issue #5151). Brendan |
If anyone needs 4Kn templates right now, can use my patch from https://gist.github.com/arno01/ae31e1e9098591dadde3d1fc8c707000 I have also found that And there is some interesting discussion about the 4Kn sector disks. |
This will become a bigger problem with R4.2, where cryptsetup >= 2.4.0 (Fedora >= 35) will automatically create dm-crypt with 4K logical sectors even on 512e drives. Ideas:
|
I wonder if Qubes pools should specify the sector size of their underlying storage technology, and whether importing volumes should involve a conversion step? B |
Conversion during import would mean parsing VM data in dom0 😬 Or a DisposableVM I guess. |
Ok someone should definitely write a DisposableVM-powered converter for common volume contents. But automatic conversion won't be possible in all cases (like standalone HVMs where a volume could contain anything, e.g. bs dependent filesystems like XFS that might not be straightforward to upgrade), so even with a very good converter there's still a need for
|
Interesting. Do you know how they implement this? Because I thought this direction is the tricky one, because a block device should guarantee atomic writes per sector (in other words you should always see either the version before the write or a fully updated sector, but not a mix). So a proper implementation likely needs a journal.
At lest on dm level support seems to exists: https://www.kernel.org/doc/html/latest/admin-guide/device-mapper/dm-ebs.html |
I'm kinda curious too about how writes really work for kernel -> 512e drive communication. Pure speculation: Since both the kernel and the drive know that the drive's physical block size is 4K, maybe the kernel just always writes batches of 8 * 512B logical blocks - and when the drive sees logical blocks coming in fast enough, one immediately following another, it figures out that read-modify-write can be avoided? Or there could be some explicit way for the kernel to signal to the drive that it's aware of 512e and that it guarantees to send 4K blocks merely encoded as batches of 512B blocks.
Huh. Thanks! Wonder if that's better than a loop device. |
I can think of at least two solutions:
|
@DemiMarie I don't get (1). Would there be some script in the VM's initrd to rewrite the partition table ("activating" the stashed away 512B or 4K version) depending on xvda's current logical block size? Dynamically switching back and forth between 512B and 4K partitioning in general seems like it could make resizing the volume ( |
Generally, I'd try to avoid any kind of conversion at startup and go for emulation when necessary. That means:
And then, either build templates with two flavors, or convert at the install time (as part of Can we get away without emulating 4k bs on 512 bs devices? |
Once there's a way to attach a volume as 4K, why even bother building (or converting to) 512B templates.
Forcing I'd guess almost all of those drives (that make sense to install a Qubes storage pool on) actually have 4K physical sectors anyway, but it's misreported by shoddy firmware or an adapter. |
I currently plan to use the rust-gptman crate for this, as it is packaged in Fedora 41 and actually has things like test suites. It can be backported to older distros by bundling it. |
@marmarek: can the tool that converts from 512 to 4K and back be provided with the previous sector size (perhaps from metadata somewhere), or does it need to guess the previous sector size? The latter could be problematic if there are multiple GPTs on the disk. |
You can get the current sector size (from sysfs?) and check if the current GPT matches. If not, do the conversion. |
For those coming to this issue, if the device supports going back to 512 bytes then you can apply the commands in #7398 (comment) as a temporary solution until this is fixed (currently using it myself and don't notice any performance issues, so far it has been even faster than 4k and btrfs) |
LVM (thin) volumes present the original sector size to the VM. Currently all the qubes templates are built with 512 sector size (that's how partition table and filesystem are created). When installing Qubes on a 4Kn disk, such VMs won't boot. LUKS2 supports passing down the sector size, and cryptsetup used in Qubes 4.2 has it enabled by default. As a temporary workaround, force 512 LUKS sector size on a LVM thin partitioning layout. QubesOS/qubes-issues#4974
Note to self: if we ever put something on the EFI partition of the template's root volume, create filesystem there with 4k block size. |
This adjusts the GPT in the initramfs, and (obviously) requires the previous commit. Fixes: QubesOS/qubes-issues#4974
* origin/pr/119: (89 commits) Fix error handling Revert "Try to avoid race conditions" Simplify exit code handling Revert "Allow searching for a partition and printing its number" gptfix: do not declare variables in for_each_used_gpt_entry() Better ambiguity errors for multiple matching partitions Fix breaking out of for_each_used_gpt_entry() Support search by both partition name and type UUID Fail if the device appears truncated Better error message when backup partition table doesn't fit gpt_adjust_sectors(): do not mutate provided GPT Iterate only over used GPT entries Add accessor functions for individual GPT entries Do not check the name of unused partition entries Have struct GPT store its sector size Allow selecting the compiler in the build environment Fix clang warnings Only reload the partition table if changes were made Allow searching for a partition and printing its number Quiet warnings when resizing partitions ... Pull request description: This has many problems: - [x] Old partition tables are not cleared before writing the partition table entry that would overwrite them. - [x] An invalid partition table header is not written before the partition table entries that the valid one would refer to. - [x] A valid partition table header is written before the entries it refers to, not after. - [x] CRC32 checksums are ~~not checked when reading (but are created when writing).~~ are now always checked. - [ ] Lots and lots of debug prints. - [ ] No tests (it does pass a local test, though). - [x] No installation (so this file would not even be included in a built package!). - [x] The fixer is recompiled even when there is no need to rebuild it. - [ ] The default CFLAGS are meant for local development & debugging. - [x] `NDEBUG` is forcibly undefined. Fixes: QubesOS/qubes-issues#4974
This adjusts the GPT in the initramfs, and (obviously) requires the previous commit. Fixes: QubesOS/qubes-issues#4974
This adjusts the GPT in the initramfs, and (obviously) requires the previous commit. Fixes: QubesOS/qubes-issues#4974
This adjusts the GPT in the initramfs, and (obviously) requires the previous commit. Fixes: QubesOS/qubes-issues#4974
This adjusts the GPT in the initramfs, and (obviously) requires the previous commit. Fixes: QubesOS/qubes-issues#4974
This adjusts the GPT in the initramfs, and (obviously) requires the previous commit. Fixes: QubesOS/qubes-issues#4974
Bootloader isn't installed there yet, but having the filesystem allows systemd to mount it (it really wants to, even if not booing in EFI mode...) instead of failing. And also, set 4k sector size to avoid compatibility issues. QubesOS/qubes-issues#4974
It's green! https://openqa.qubes-os.org/tests/121681 |
Qubes OS version
R4.0
Affected component(s) or functionality
VMs not working/starting right from a fresh install.
Brief summary
Right after a fresh install, all VMs fail to mount root and therefore fails to start beyond the point where they expect /dev/xvda3 available. This happens on a device that has 4kB logical and physical block sizes (NVMe drive). This was not problem in R3.2 (as it used files by default for VM storage).
To Reproduce
Steps to reproduce the behavior:
Expected behavior
VMs would start. Firstboot stuff would work. Drives with 4kB sector size would work.
Additional context
I've tracked this down to the handling of the partition table. With 512B sectors the location of the GPT differs from that of with 4kB sectors and therefore VMs fail to find the correct partition table from xvda. Obviously also the partition start/end values will be off by the factor of 8 because the templates are built(?) with an assumption of 512B sector size.
I'm not sure if there are other assumptions based on 512B sectors with the other /dev/xvd* drives.
Solutions you've tried
I cloned a template and I tried to manually fix the partition table of the clone (in dom0 through /dev/qubes_dom0/...). There's was plenty of space before the first partition, however, at the end the drive is so tight on space that the GPT secondary table won't fit so the xvda3 partition's tail was truncated slightly and I didn't try to resize its filesystem first (this probably causes some problems, potentially corruption?). With such a fixed partition table, I could start VMs (but there are then some other problems/oddities that might be due to incomplete firstboot or non-fixed fedora template, I only fixed the debian one which I mainly use normally). I could possibly enlarge the relevant LV slightly to avoid the truncate problem at the tail of xvda3 but I've not tried that yet.
I tried to look if I could somehow force pv/vg/lv chain to fake the logical sector size but couldn't find anything from the manpages.
Libvirt might be able to fake the
logical_block_size
but I've not yet tried that.Relevant documentation you've consulted
During install, I used the custom install steps to create manual partitioning (but I think it is irrelevant).
Related, non-duplicate issues
None I could find, some other issues included failure to mount root successfully but the causes are different.
Decided solution
Add a partition table conversion to initramfs. Specifically, write a tool that would check if partition table matches current block size. If it matches, do nothing. If not, convert it to the right block size format before mounting anything. And destroy the wrong partition table (if isn't directly overridden by the converted one) to prevent confusion which one is the current one.
References:
#4974 (comment)
#4974 (comment)
The text was updated successfully, but these errors were encountered: