Merge master into feature/config-ntp-timezone-maxcstate #6787

changlei-li · 2025-12-10T09:49:17Z

Solve the conflict of new added host fileds.
Add one commit to update lifecycle.

…ode set Could also compute it by multiplying it with [threads_per_core], but I'm not sure how that'd interact with [smt=false] in Xen. Also to future-proof this I wouldn't want to rely on an entirely symmetrical architecture (although it'd be very rare to have anything other than 2 on x86-64, or to have hyperthreading on in one socket, and off in another). Note that core ids are not unique (there is a core `0` on both socket 0 and socket 1 for example), so only work with number of cores in the topology code. Could've created a CoreSocketSet instead (assuming that no higher grouping than sockets would exist in the future), but for now don't make too many assumptions about topology. No functional change. Signed-off-by: Edwin Török <[email protected]>

The planner explicitly looks at the NUMARequest fields and checks that they are non-zero. However if more fields get added in the future this leads to an assertion failure, where the planner thinks it has found a solution, but NUMARequest.fits returns false. Ensure consistency: use `fits` in the planner to check that we've reached a solution. If the remaining request doesn't fit into an empty node, then the request is not empty. Signed-off-by: Edwin Török <[email protected]>

The requested number of cores is still 0, so no functional change. Signed-off-by: Edwin Török <[email protected]>

…io_mem_only The current NUMA policy prioritizes reducing cross-NUMA node memory traffic by picking the smallest set of NUMA nodes that fit a VM. It doesn't look at how this affects CPU overload within a NUMA node, or whether the local bandwidth of each NUMA node is balanced or not. Give this policy an explicit name, `Prio_mem_only`, and when the "compat" setting in `xenopsd.conf` is used (`numa-placement=true`), then explicitly use this policy instead of Best-effort. Currently Best-effort is still equivalent to this policy, but that'll change in a follow-up commit. Introduce a new xenopsd.conf entry `numa-best-effort-prio-mem-only`, which can be used to explicitly revert best effort to the current policy. (currently this is a no-op, because there is only one best-effort policy). Future policies should also look at CPU overload. No functional change. Signed-off-by: Edwin Török <[email protected]>

NUMA optimized placement can have a large performance hit on machines with small NUMA nodes and VMs with a large number of vCPUs. For example a machine that has 2 sockets, which can run at most 32 vCPUs in a single socket (NUMA node), and a VM with 32 vCPUs. Usually Xen would try to spread the load across actual cores, and avoid the hyperthread siblings, e.g. using CPUs 0,2,4,etc. But when NUMA placement is used all the vCPUs must be in the same NUMA node. If that NUMA node doesn't have enough cores, then Xen will have no choice but to use CPUs 0,1,2,3,etc. Hyperthread siblings share resources, and if you try to use both at the same time you get a big performance hit, depending on the workload. Avoid this by "requesting" cores=vcpus for each VM, which will make the placement algorithm choose the next size up in terms of NUMA nodes (i.e. instead of 1 NUMA node, use 2,3 as needed, falling back to using all nodes if needed). The potential gain from reducing memory latency with a NUMA optimized placement (~20% on Intel Memory Latency Checker: Idle latency) is outweighed by the potential loss due to reduced CPU capacity (40%-75% on OpenSSL, POV-Ray, and OpenVINO), so this is the correct trade-off. If the NUMA node is large enough, or if the VMs have a small number of vCPUs then we still try to use a single NUMA node as we did previously. The performance difference can be reproduced and verified easily by running `openssl speed -multi 32 rsa4096` on a 32 vCPU VM on a host that has 2 NUMA nodes, with 32 PCPUs each, and 2 threads per core. This introduces a policy that can control whether we want to filter out NUMA nodes with too few cores. Although we want to enable this filter by default, we still want an "escape hatch" to turn it off if we find problems with it. That is why the "compat" setting (numa_placement=true) in xenopsd.conf reverts back to the old policy, which is now named explicitly as Prio_mem_only. There could still be workloads where optimizing for memory bandwidth makes more sense (although that is a property of the NUMA node, not of individual VMs), so although it might be desirable for this to be a VM policy, it cannot, because it affects other VMs too. TODO: when sched-gran=core this should be turned off. That always has the performance hit, so might as well use smaller NUMA nodes if available. For now this isn't exposed yet as a XAPI-level policy, because that requires more changes (to also sort by free cores on a node, and to also sort at the pool level by free cpus on a host). Once we have those changes we can introduce a new policy `prio_core_mem` to sort by free cores first, then by free memory, and requires cores>=vcpus (i.e. cpus>=vcpus*threads_per_cores) when choosing a node. This changes the default to the new setting, which should be equal or an improvement in the general case. An "escape hatch" to revert to the previous behaviour is to set `numa-placement=true` in xenopsd.conf, and the XAPI host-level policy to 'default_policy'. Signed-off-by: Edwin Török <[email protected]>

Signed-off-by: Guillaume <[email protected]>

This partially applies the following commit to reduce the complexity of a Xen-4.20 patch: > xenopsd-xc: do not try keep track of free memory when planning NUMA nodes (CA-411684) > > Free memory is now properly accounted for because the memory pages are claimed > within the NUMA mutex, so there's no need to have double tracking. > > On top of that, this code never increased the free memory, which means that it > always reached a point where it was impossible to allocate a domain into a > single numa node. > Signed-off-by: Pau Ruiz Safont <[email protected]> However it doesn't actually drop the free memory accounting code, so: No functional change Signed-off-by: Edwin Török <[email protected]>

Signed-off-by: Ming Lu <[email protected]>

The behaviour is same, except that the file will remain (with empty rules) even if the forwarding log has been disabled (and then prevent the disclaimer loss). Wording slightly modified to reduce the risk of concurrent management of this file, if the file is still edited by other mean, it will be done knowingly (that x-s-r may override any change soon or later). For the record remote.conf was introduced in: 468eb75 . I am assuming that the presence of file is not checked elsewhere that in xen-api (currently only s/x-s-r is referencing this file). Signed-off-by: Philippe Coval <[email protected]>

The methods were not safe and thankfully unused as well Signed-off-by: Pau Ruiz Safont <[email protected]>

Signed-off-by: Pau Ruiz Safont <[email protected]>

During the bringup of a new platform for XCP-ng, one of the users has raised an exception because the database doesn't have any pool record. This change makes sure that the there's more information on the exception. Also changes other places where Pool.get_all to make it obvious that the result is matched and List.hd is less likely to be used. Signed-off-by: Pau Ruiz Safont <[email protected]>

In particular a couple of unused modules using it have been removed, and using List.hd on Pool.get_all has been removed and replaced with `Failure`s with the name of the function raising the exception. The latter can be hit in some exceptional cases, like some platform engineers have found when preparing and testing xcp-ng 9.

…#6774) The behaviour is same, except that the file will remain (with empty rules) even if the forwarding log has been disabled (and then prevent the disclaimer loss). Wording slightly modified to reduce the risk of concurrent management of this file, if the file is still edited by other mean, it will be done knowingly (that x-s-r may override any change soon or later). For the record remote.conf was introduced in: 468eb75 . I am assuming that the presence of file is not checked elsewhere that in xen-api (currently only s/x-s-r is referencing this file).

The xenstore watcher maintains a map from domid to VM UUID. This map is used to dispatch the xenstore events. When the VM is renamed, its UUID changes. Hence this map needs to refresh. Otherwise, the xenstore events could not be dispatched to renamed VM. Signed-off-by: Ming Lu <[email protected]>

Signed-off-by: Ming Lu <[email protected]>

…med (xapi-project#6773) (Two commits are included in this PR. The first one is **only to move the position of existing code within the same file** to allow the use of module Watcher in module VM in the second one.) The xenstore watcher maintains a map from domid to VM UUID. This map is used to dispatch the xenstore events. When the VM is renamed, its UUID changes. Hence this map needs to refresh. Otherwise, the xenstore events could not be dispatched to renamed VM.

- Introduce https_only argument for Host.create - Set https_only from configuration for installation - Keep https_only from joining host during pool join Signed-off-by: Lin Liu <[email protected]>

- Introduce https_only argument for Host.create - Set https_only from configuration for installation - Keep https_only from joining host during pool join

…ect#6763) NUMA optimized placement can have a large performance hit on machines with small NUMA nodes and VMs with a large number of vCPUs. For example a machine that has 2 sockets, which can run at most 32 vCPUs in a single socket (NUMA node), and a VM with 32 vCPUs. Usually Xen would try to spread the load across actual cores, and avoid the hyperthread siblings (when the machine is sufficiently idle, or the workload is bursty), e.g. using CPUs 0,2,4,etc. But when NUMA placement is used all the vCPUs must be in the same NUMA node. If that NUMA node doesn't have enough cores, then Xen will have no choice but to use CPUs 0,1,2,3,etc. Hyperthread siblings share resources, and if you try to use both at the same time you get a big performance hit, depending on the workload. We've also seen this previously with Xen's core-scheduling support (which is off by default) Avoid this by "requesting" `threads_per_core` times more vCPUs for each VM, which will make the placement algorithm choose the next size up in terms of NUMA nodes (i.e. instead of a single NUMA node use 2,3 as needed, falling back to using all nodes if needed). The potential gain from reducing memory latency with a NUMA optimized placement (~20% on Intel Memory Latency Checker: Idle latency) is outweighed by the potential loss due to reduced CPU capacity (40%-75% on OpenSSL, POV-Ray, and OpenVINO), so this is the correct tradeoff. If the NUMA node is large enough, or if the VMs have a small number of vCPUs then we still try to use a single NUMA node as we did previously. The performance difference can be reproduced and verified easily by running `openssl speed -multi 32 rsa4096` on a 32 vCPU VM on a host that has 2 NUMA nodes, with 32 PCPUs each, and 2 threads per core.

This partially applies the following commit to reduce the complexity of a Xen-4.20 patch: > xenopsd-xc: do not try keep track of free memory when planning NUMA nodes (CA-411684) > > Free memory is now properly accounted for because the memory pages are claimed > within the NUMA mutex, so there's no need to have double tracking. > > On top of that, this code never increased the free memory, which means that it > always reached a point where it was impossible to allocate a domain into a > single numa node. > Signed-off-by: Pau Ruiz Safont <[email protected]> However it doesn't actually drop the free memory accounting code, so: No functional change

Define and implement an operation that uses Xen's fast resume to reume a domain. This operation is currently not used but has been tested. It is accessible from the xenopsd CLI ("xenops-cli") for experiments. Signed-off-by: Christian Lindig <[email protected]>

Signed-off-by: Christian Lindig <[email protected]>

The current default value for the NVMe MDTS parameter exposed in QEMU emulated NMVe devices is 7 (max 512KiB requests). However there seems to be an internal Windows Server 2025 issue that possibly triggers when splitting bigger requests into smaller on in the NVMe Windows driver. Increase the exposed MDTS value on the emulated QEMU NVMe device to 9 (max 2MiB request size), as that seems to drop the reproduction rate of the issue. Discussion is ongoing with Microsoft to get the issue identified and possibly sorted on their end. For the time being apply this mitigation in qemu-wrapper as a workaround. Signed-off-by: Roger Pau Monné <[email protected]> Signed-off-by: Christian Lindig <[email protected]>

The current default value for the NVMe MDTS parameter exposed in QEMU emulated NMVe devices is 7 (max 512KiB requests). However there seems to be an internal Windows Server 2025 issue that possibly triggers when splitting bigger requests into smaller on in the NVMe Windows driver. Increase the exposed MDTS value on the emulated QEMU NVMe device to 9 (max 2MiB request size), as that seems to drop the reproduction rate of the issue. Discussion is ongoing with Microsoft to get the issue identified and possibly sorted on their end. For the time being apply this mitigation in qemu-wrapper as a workaround.

Define and implement an operation that uses Xen's fast resume to reume a domain. This operation is currently not used but has been tested. It is accessible from the xenopsd CLI ("xenops-cli") for experiments.

During rolling pool upgrade (RPU), RestartVM guidance should only be cleared when a VM restarts on a host that has been updated to match the coordinator's software version. Previously, the guidance was cleared whenever a VM restarted, regardless of the host's update status. This commit ensures that RestartVM guidance persists until the VM restarts on an up-to-date host, this provides accurate feedback to administrators about which VMs still need restarting after RPU. Also adds unit tests covering 6 scenarios: * VM start on updated vs old host (via xenopsd) * Suspended VM resume on updated vs old host * VM halt on updated vs old host (via force_state_reset) Signed-off-by: Gang Ji <[email protected]>

…oject#6782) During rolling pool upgrade (RPU), RestartVM guidance should only be cleared when a VM restarts on a host that has been updated to match the coordinator's software version. Previously, the guidance was cleared whenever a VM restarted, regardless of the host's update status. This commit ensures that RestartVM guidance persists until the VM restarts on an up-to-date host, this provides accurate feedback to administrators about which VMs still need restarting after RPU. Also adds unit tests covering 6 scenarios: * VM restart on updated vs old host (via xenopsd) * VM halt on updated vs old host (via force_state_reset) * Suspended VM resume on updated vs old host

It returns info on the allocated clusters in a JSON. Signed-off-by: Andrii Sultanov <[email protected]>

On export, instead of reading the whole raw disk, consult the JSON (if provided), and only allocate the clusters that are present in the table. This is analogous to vhd-tool's handling of export, and greatly speeds up handling of sparse disks. Signed-off-by: Andrii Sultanov <[email protected]>

Take the expected driver type as a parameter, to allow this helper to be used by qcow code as well. Signed-off-by: Andrii Sultanov <[email protected]>

Pass the JSON output of read_headers into qcow2-to-stdout to handle the export further. Signed-off-by: Andrii Sultanov <[email protected]>

…ters Translates JSON from qcow-stream-tool to OCaml types. This is currently unused, but will be used in stream_vdi and vhd_tool_wrapper in the future. Signed-off-by: Andrii Sultanov <[email protected]>

…t#6769) Implements an optimization similar to the "hybrid" mode in `vhd-tool`: when exporting to qcow from raw, if the VDI is backed by a QCOW file, read its header, determine the allocated clusters, and only export these. This allows skipping over zero clusters in a sparse disk. Unlike `vhd-tool`, however, this is implemented in a modular way - `qcow-stream-tool` gets a new `read_headers` command that outputs the list of allocated clusters (and other info) in JSON format, which allows it to be consumed by the Python `qcow-to-stdout` script (and by `vhd-tool` in future stages of this work, see below). This is the first step of improving handling of sparse VDIs in xapi. I've got the rest working, but I'll be opening PRs step-by-step for the following once this PR gets merged: 1. `vhd-tool` gets a `read_headers` command outputting list of allocated blocks as well 2. `stream_vdi` uses `read_headers` for both VHD and QCOW to avoid reading zero blocks on XVA export (greatly speeds up handling of sparse disks and avoids issues with timeouts) 3. `vhd-tool` and `qcow-to-stdout` can read headers of the opposite format, allowing faster export of sparse VDIs backed by a different format. Best reviewed by commit.

Signed-off-by: Changlei Li <[email protected]>

changlei-li · 2025-12-10T09:50:35Z

conflict, keep all the fields.

> git diff
diff --cc ocaml/tests/common/test_common.ml
index 9d5101999,7fc190f43..000000000
--- a/ocaml/tests/common/test_common.ml
+++ b/ocaml/tests/common/test_common.ml
@@@ -224,8 -224,7 +224,12 @@@ let make_host2 ~__context ?(ref = Ref.m
      ~pending_guidances_recommended:[] ~pending_guidances_full:[]
      ~last_update_hash:"" ~ssh_enabled:true ~ssh_enabled_timeout:0L
      ~ssh_expiry:Date.epoch ~console_idle_timeout:0L ~ssh_auto_mode:false
++<<<<<<< HEAD
 +    ~max_cstate:"" ~secure_boot:false ~ntp_mode:`Factory ~ntp_custom_servers:[]
 +    ~timezone:"UTC" ;
++=======
+     ~secure_boot:false ~https_only ;
++>>>>>>> master
    ref
  
  let make_pif ~__context ~network ~host ?(device = "eth0")

edwintorok and others added 30 commits November 27, 2025 10:28

CA-420968: track number of physical cores during a NUMA planning request

3671ba0

The requested number of cores is still 0, so no functional change. Signed-off-by: Edwin Török <[email protected]>

[doc] add missing command to xs-trace

2dbbbd0

Signed-off-by: Guillaume <[email protected]>

CA-419908: Move module Watcher ahead for future use in module VM

98d5077

Signed-off-by: Ming Lu <[email protected]>

ocaml/xapi: remove unused xmlrpc_sexpr module

77c57b6

The methods were not safe and thankfully unused as well Signed-off-by: Pau Ruiz Safont <[email protected]>

ocaml/libs/http-lib: remove unused mime module

36f2a47

Signed-off-by: Pau Ruiz Safont <[email protected]>

Refactor IntMap to use built-in Int module in Map.Make

1b7f149

Signed-off-by: Ming Lu <[email protected]>

CP-309847: Make HTTP/80 configurable

c1b1311

- Introduce https_only argument for Host.create - Set https_only from configuration for installation - Keep https_only from joining host during pool join Signed-off-by: Lin Liu <[email protected]>

CP-309847: Make HTTP/80 configurable (xapi-project#6770)

62f962b

- Introduce https_only argument for Host.create - Set https_only from configuration for installation - Keep https_only from joining host during pool join

fixup! CP-31566 define xenopsd fast resume operation

88ece89

Signed-off-by: Christian Lindig <[email protected]>

CP-31566 define xenopsd fast resume operation (xapi-project#6780)

58f09ac

Define and implement an operation that uses Xen's fast resume to reume a domain. This operation is currently not used but has been tested. It is accessible from the xenopsd CLI ("xenops-cli") for experiments.

[doc] add missing command to xs-trace (xapi-project#6771)

bf984c0

qcow-stream-tool: Add read_headers command

bb705d1

It returns info on the allocated clusters in a JSON. Signed-off-by: Andrii Sultanov <[email protected]>

last-genius and others added 7 commits December 8, 2025 08:53

vhd_tool_wrapper: Make vhd_of_device generic

15f8088

Take the expected driver type as a parameter, to allow this helper to be used by qcow code as well. Signed-off-by: Andrii Sultanov <[email protected]>

qcow_tool_wrapper: Read headers of QCOW2-backed VDIs on export

3685968

Pass the JSON output of read_headers into qcow2-to-stdout to handle the export further. Signed-off-by: Andrii Sultanov <[email protected]>

qcow_tool_wrapper: Implement parse_header to determine allocated clus…

8914076

…ters Translates JSON from qcow-stream-tool to OCaml types. This is currently unused, but will be used in stream_vdi and vhd_tool_wrapper in the future. Signed-off-by: Andrii Sultanov <[email protected]>

Merge master into feature/config-ntp-timezone-maxcstate

dcc6219

Update lifecycle

fe45173

Signed-off-by: Changlei Li <[email protected]>

BengangY approved these changes Dec 10, 2025

View reviewed changes

last-genius approved these changes Dec 10, 2025

View reviewed changes

changlei-li merged commit e02d597 into xapi-project:feature/config-ntp-timezone-maxcstate Dec 10, 2025
16 checks passed

changlei-li deleted the private/changleli/sync-with-master branch December 10, 2025 09:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Merge master into feature/config-ntp-timezone-maxcstate #6787

Merge master into feature/config-ntp-timezone-maxcstate #6787

Uh oh!

changlei-li commented Dec 10, 2025

Uh oh!

changlei-li commented Dec 10, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

12 participants

Merge master into feature/config-ntp-timezone-maxcstate #6787

Merge master into feature/config-ntp-timezone-maxcstate #6787

Uh oh!

Conversation

changlei-li commented Dec 10, 2025

Uh oh!

changlei-li commented Dec 10, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

12 participants