Skip to content

Conversation

6-dehan
Copy link
Contributor

@6-dehan 6-dehan commented May 14, 2025

Add a new guest agent command 'guest-get-load' to
get cpu load average info of Guest.

ID: 3534
Signed-off-by: Dehan Meng [email protected]

Summary by CodeRabbit

  • New Features

    • Added guest load monitoring to the QEMU Guest Agent tests for Linux and Windows to measure and validate CPU load under stress.
  • Tests

    • New test compares guest-reported vs agent-reported 1/5/15-minute load values, verifies increases during CPU stress and decreases afterward.
    • Improved logging of pre-, during-, and post-stress load values for clearer diagnostics.
  • Chores

    • Updated configuration to enable the new load-check workflow.

@6-dehan 6-dehan force-pushed the KVMAUTOMA-3534-cpuload branch 7 times, most recently from 1a6487d to c9269f4 Compare May 15, 2025 08:02
@6-dehan
Copy link
Contributor Author

6-dehan commented May 15, 2025

for linux guest:
(1/1) Host_RHEL.m9.u6.ovmf.qcow2.virtio_scsi.up.virtio_net.Guest.RHEL.9.7.0.x86_64.io-github-autotest-qemu.qemu_guest_agent.virtio_serial.check_get_load.q35: PASS (165.03 s)

@6-dehan 6-dehan force-pushed the KVMAUTOMA-3534-cpuload branch 3 times, most recently from 5ca1445 to fd0cfe1 Compare May 16, 2025 04:45
@6-dehan
Copy link
Contributor Author

6-dehan commented May 16, 2025

For Window VM:
(1/1) Host_RHEL.m9.u6.ovmf.qcow2.virtio_scsi.up.virtio_net.Guest.Win2025.x86_64.io-github-autotest-qemu.qemu_guest_agent.virtio_serial.check_get_load.q35: PASS (143.49 s)

@6-dehan
Copy link
Contributor Author

6-dehan commented May 16, 2025

@leidwang Could you please help to review this when you're free?

@6-dehan 6-dehan force-pushed the KVMAUTOMA-3534-cpuload branch from 7a7c8c3 to f8a677b Compare August 21, 2025 03:44
@6-dehan
Copy link
Contributor Author

6-dehan commented Sep 1, 2025

Linux could be passed normally, Windows still has to adjust a little.
(3/4) Host_RHEL.m10.u1.ovmf.qcow2.virtio_scsi.up.virtio_net.Guest.RHEL.10.1.x86_64.io-github-autotest-qemu.qemu_guest_agent.virtio_serial.check_get_load.q35: PASS (136.85 s)

@6-dehan 6-dehan force-pushed the KVMAUTOMA-3534-cpuload branch from f8a677b to 6afbcbf Compare September 26, 2025 05:27
Copy link

coderabbitai bot commented Sep 26, 2025

Walkthrough

Adds a new QGA test gagent_check_get_load that measures guest and QGA load averages, runs platform-specific CPU stress, and verifies load increase/decrease; also adds a check_get_load block to qemu_guest_agent.cfg with commands to retrieve loads and run stress tools.

Changes

Cohort / File(s) Summary of edits
QGA test logic: get-load
qemu/tests/qemu_guest_agent.py
Adds gagent_check_get_load(self, test, params, env) and helper methods _get_load_stats, _verify_load_values, _log_load_values. Implements Linux and Windows flows to capture baseline loads, trigger CPU stress, compare guest vs QGA loads, and verify increase/decrease across phases.
QGA config: get-load commands
qemu/tests/cfg/qemu_guest_agent.cfg
Adds check_get_load block under gagent_debugview_VSS_DLL defining gagent_check_type = get_load, cmd_get_load, cmd_install_stressng, cmd_run_stress and a Windows PowerShell stress variant.
Docstring formatting changes
qemu/tests/qemu_guest_agent.py (multiple locations)
Minor docstring wording/formatting adjustments (e.g., “:param env: Dictionary with test environment.” → “:param env: Dictionary with the test environment”). No runtime behavior changes.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant TR as Test Runner
  participant T as QemuGuestAgentTest
  participant G as QGA
  participant S as Guest Session
  participant ST as Stress Tool

  TR->>T: gagent_check_get_load(test, params, env)
  T->>S: open session
  alt Linux
    T->>S: cmd_get_load (baseline)
    S-->>T: guest_load_baseline
    T->>G: gagent.get_load (baseline)
    G-->>T: qga_load_baseline
    T->>S: cmd_install_stressng
    T->>S: cmd_run_stress (start ST async)
    S->>ST: start workload (async)
    T->>S: cmd_get_load (under load)
    S-->>T: guest_load_stress
    T->>G: gagent.get_load (under load)
    G-->>T: qga_load_stress
    T->>T: verify increase
    ST-->>S: complete
    T->>S: cmd_get_load (post)
    S-->>T: guest_load_post
    T->>G: gagent.get_load (post)
    G-->>T: qga_load_post
    T->>T: verify decrease
  else Windows
    T->>G: gagent.get_load (baseline)
    G-->>T: qga_load_baseline
    T->>S: start PowerShell stress job (async)
    S->>ST: start workload (async)
    T->>G: gagent.get_load (under load)
    G-->>T: qga_load_stress
    T->>T: verify increase
    ST-->>S: complete
    T->>G: gagent.get_load (post)
    G-->>T: qga_load_post
  end
  T->>S: close session
  T-->>TR: test result
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

I twitch my whiskers, tap the keys,
I coax the CPU up from its ease.
One, five, fifteen — the numbers play,
I watch them climb, then fade away.
Hoppity hops for tests today! 🥕🐇

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check ✅ Passed The title clearly identifies the qemu_guest_agent component and succinctly describes the primary addition of support for the new ‘guest-get-load’ API command, which aligns directly with the pull request’s main objective. It is concise, specific, and free of extraneous detail, ensuring that reviewers can immediately understand the key change.
Docstring Coverage ✅ Passed Docstring coverage is 93.42% which is sufficient. The required threshold is 80.00%.
✨ Finishing touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 870c1ea and 6afbcbf.

📒 Files selected for processing (2)
  • qemu/tests/cfg/qemu_guest_agent.cfg (1 hunks)
  • qemu/tests/qemu_guest_agent.py (10 hunks)
🔇 Additional comments (1)
qemu/tests/cfg/qemu_guest_agent.cfg (1)

675-675: Confirm cleanup of Windows stress jobs

This PowerShell snippet spawns 12 background jobs that loop forever. If they aren't explicitly stopped later, they'll keep chewing CPU after the test finishes and bleed into subsequent cases. Please confirm the test logic tears them down (e.g. Get-Job | Stop-Job -Force; Get-Job | Remove-Job); otherwise, add a cleanup step.

@6-dehan 6-dehan force-pushed the KVMAUTOMA-3534-cpuload branch 2 times, most recently from 0c264cc to c6efd4e Compare September 26, 2025 05:51
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (3)
qemu/tests/qemu_guest_agent.py (3)

2109-2146: Make prev_values explicit in _verify_load_values.

The helper reads prev_values from closure, which is implicit and brittle. Pass it as a parameter for clarity and easier testing.

Apply this diff and update call sites:

-def _verify_load_values(qga_vals, guest_vals, check_type="match"):
+def _verify_load_values(qga_vals, guest_vals, check_type="match", prev_values=None):
@@
-            if check_type != "match" and prev_values:
+            if check_type != "match" and prev_values:

And later:

-            if errors := _verify_load_values(qga_vals, guest_vals):
+            if errors := _verify_load_values(qga_vals, guest_vals, prev_values=prev_values):
@@
-            if errors := _verify_load_values(qga_vals, guest_vals, "increase"):
+            if errors := _verify_load_values(qga_vals, guest_vals, "increase", prev_values=prev_values):
@@
-            if errors := _verify_load_values(qga_vals, guest_vals, "decrease"):
+            if errors := _verify_load_values(qga_vals, guest_vals, "decrease", prev_values=prev_values):

2208-2214: Prefer package helper or presence check before installing stress-ng.

Hard-failing on install can be flaky across distros/repos. Consider utils_package.package_install("stress-ng", session) or first checking if stress-ng is already present, then installing with a distro-agnostic helper.


2212-2227: Sleep-based timing is brittle; poll until condition or timeout.

Fixed sleeps (25s/60s) can be too short or long. Prefer polling for load crossing a threshold or for stress-ng to exit (e.g., utils_misc.wait_for with a predicate).

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 0c264cc and c6efd4e.

📒 Files selected for processing (2)
  • qemu/tests/cfg/qemu_guest_agent.cfg (1 hunks)
  • qemu/tests/qemu_guest_agent.py (9 hunks)
🔇 Additional comments (1)
qemu/tests/cfg/qemu_guest_agent.cfg (1)

672-672: Fix stress-ng installation for non-DNF guests (fallback to yum).

RHEL 7 doesn't have dnf; current command will fail there. Use a simple dnf→yum fallback.

Apply this diff:

-            cmd_install_stressng = "dnf -y install stress-ng"
+            cmd_install_stressng = "if command -v dnf; then dnf -y install stress-ng; else yum -y install stress-ng; fi"

@6-dehan
Copy link
Contributor Author

6-dehan commented Sep 26, 2025

Hi @leidwang, Linux always passes. Currently, Windows has a product issue that I'm discussing with Dev. so the automation code is fine I think.

(1/2) Host_RHEL.m10.u1.ovmf.qcow2.virtio_scsi.up.virtio_net.Guest.RHEL.10.1.x86_64.io-github-autotest-qemu.qemu_guest_agent.virtio_serial.check_get_load.q35: PASS (151.52 s)

Copy link

@arka-majumd arka-majumd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typos

1. Add a new guest agent command 'guest-get-load' to
get cpu load average info of Guest.
2. Randomly sorted out non-standard comments

Signed-off-by: Dehan Meng <[email protected]>
@6-dehan 6-dehan force-pushed the KVMAUTOMA-3534-cpuload branch from c6efd4e to a5d7ca3 Compare September 30, 2025 15:51
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
qemu/tests/qemu_guest_agent.py (1)

2160-2195: Windows stress jobs persist; add cleanup and verify post-stress decrease

Start-Job workers are never stopped and the test doesn’t check that load falls after stress. Add explicit cleanup and a decrease check; also run cleanup in finally. This avoids leaving the VM pegged and flaking follow-on tests.

Apply:

@@
-        if params.get("os_type") == "windows":
+        if params.get("os_type") == "windows":
             error_context.context("Get load info for Windows", LOG_JOB.info)
-            try:
+            # Stop any leftover stress jobs by command pattern used in cmd_run_stress
+            cleanup_cmd = (
+                'powershell "$jobs = Get-Job | Where-Object { $_.Command -like '
+                "''*sqrt(999999)*'' }; if ($jobs) { $jobs | Stop-Job -Force; "
+                '$jobs | Remove-Job }"'
+            )
+            try:
                 # Get initial load values
                 load_info = self.gagent.get_load()
@@
                 LOG_JOB.info(
                     "Load value increased as expected: before=%s, after=%s",
                     initial_load,
                     stress_load,
                 )
+                # Stop stress jobs and verify load decreases
+                session.cmd(cleanup_cmd)
+                time.sleep(20)
+                after_info = self.gagent.get_load()
+                after_load = after_info["load1m"]
+                LOG_JOB.info("Load info after stopping stress: %s", after_info)
+                if after_load >= stress_load:
+                    test.fail(
+                        f"Load value did not decrease after stopping stress:"
+                        f" stress={stress_load}, after={after_load}"
+                    )
             except guest_agent.VAgentCmdError as e:
                 test.fail(f"guest-get-load command failed: {e}")
+            finally:
+                # Best-effort cleanup even on failure
+                session.cmd(cleanup_cmd)

To verify locally, rerun the Windows variant and confirm jobs are gone after the test and load1m drops below the stressed value.

🧹 Nitpick comments (3)
qemu/tests/qemu_guest_agent.py (3)

2109-2146: Make prev_values an explicit parameter of _verify_load_values

The helper implicitly closes over prev_values. Pass it explicitly to avoid hidden state and misuse.

-def _verify_load_values(qga_vals, guest_vals, check_type="match"):
+def _verify_load_values(qga_vals, guest_vals, check_type="match", prev_values=None):
@@
-    if check_type != "match" and prev_values:
+    if check_type != "match":
+        if not prev_values:
+            return ["prev_values is required when check_type != 'match'"]
         qga_1m = qga_vals[0]
         guest_1m = guest_vals[0]
         prev_qga_1m = prev_values["qga"][0]
         prev_guest_1m = prev_values["guest"][0]

And pass it at call sites:

- if errors := _verify_load_values(qga_vals, guest_vals, "increase"):
+ if errors := _verify_load_values(qga_vals, guest_vals, "increase", prev_values):
@@
- if errors := _verify_load_values(qga_vals, guest_vals, "decrease"):
+ if errors := _verify_load_values(qga_vals, guest_vals, "decrease", prev_values):

Also applies to: 2220-2221, 2232-2233


2095-2101: Harden guest load parsing to tolerate formats like ‘uptime’

If cmd_get_load returns “load average: 0.10, 0.05, 0.01”, split() will leave commas and raise ValueError. Use a regex to extract the first three numbers.

-                try:
-                    loads = session.cmd_output(params["cmd_get_load"]).strip().split()
-                    return tuple(round(float(x), 2) for x in loads[:3])
+                try:
+                    out = session.cmd_output(params["cmd_get_load"]).strip()
+                    # capture floats/ints, ignore commas/labels
+                    nums = re.findall(r"[-+]?(?:\d*\.\d+|\d+)", out)
+                    return tuple(round(float(x), 2) for x in nums[:3])

2172-2176: Reduce fixed sleeps; prefer polling with a deadline

Replace sleeps with utils_misc.wait_for to reduce flakiness and overall wall time. For example, wait until load1m exceeds baseline by a delta, then proceed; later wait until it drops below a threshold.

Example pattern:

# Wait up to 30s for increase of >= 0.5
utils_misc.wait_for(lambda: self.gagent.get_load()["load1m"] >= prev_values["qga"][0] + 0.5, 30, 2)

# After stopping stress, wait up to 60s for decrease
utils_misc.wait_for(lambda: self.gagent.get_load()["load1m"] <= stress_load - 0.3, 60, 3)

Please confirm acceptable deltas for your environments.

Also applies to: 2208-2214, 2225-2230

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c6efd4e and a5d7ca3.

📒 Files selected for processing (2)
  • qemu/tests/cfg/qemu_guest_agent.cfg (1 hunks)
  • qemu/tests/qemu_guest_agent.py (56 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • qemu/tests/cfg/qemu_guest_agent.cfg
🔇 Additional comments (1)
qemu/tests/qemu_guest_agent.py (1)

457-457: Docstring/env param wording updates

Spelling/consistency fixes to “:param env: Dictionary with the test environment” across methods look good. No further action.

Also applies to: 484-484, 512-512, 528-528, 665-665, 685-685, 729-729, 779-779, 821-821, 960-960, 997-997, 1085-1085, 1174-1174, 1369-1369, 1655-1655, 1732-1732, 1905-1905, 1971-1971, 2069-2071, 2242-2242, 2381-2381, 2493-2493, 2625-2625, 2759-2759, 2920-2920, 3057-3057, 3089-3089, 3090-3090, 3182-3182, 3242-3242, 3261-3261, 3368-3369, 3555-3556, 3850-3850, 3880-3880, 3921-3921, 4049-4053, 4071-4071, 4121-4121, 4182-4182, 4561-4561, 4688-4689, 4790-4790, 4956-4957, 5030-5031, 5142-5143, 5184-5185, 5257-5258

Comment on lines +2208 to +2234
error_context.context("Starting CPU stress test", LOG_JOB.info)
s, o = session.cmd_status_output(params["cmd_install_stressng"])
if s != 0:
test.error(f"Failed to install stress-ng: {o}")
session.cmd(params["cmd_run_stress"])
time.sleep(25)

guest_vals = _get_load_stats(session)
qga_vals = _get_load_stats(session, False)

_log_load_values(guest_vals, qga_vals, "Under stress")

if errors := _verify_load_values(qga_vals, guest_vals, "increase"):
test.fail("Stress test load check failed:\n" + "\n".join(errors))

prev_values = {"guest": guest_vals, "qga": qga_vals}

# sleep (60) wait for the stress-ng terminated.
time.sleep(60)
guest_vals = _get_load_stats(session)
qga_vals = _get_load_stats(session, False)

_log_load_values(guest_vals, qga_vals, "After stress")

if errors := _verify_load_values(qga_vals, guest_vals, "decrease"):
test.fail("Post-stress load check failed:\n" + "\n".join(errors))

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major

Best-effort Linux stress cleanup to avoid polluting subsequent tests

If stress-ng doesn’t terminate cleanly, it may bias later readings. Add a fallback stop command after the 60s wait (configurable via params).

-            # sleep (60) wait for the stress-ng terminated.
-            time.sleep(60)
+            # Wait for stress to end, then ensure it's stopped.
+            time.sleep(60)
+            stop_cmd = params.get("cmd_stop_stress", "pkill -f stress-ng || true")
+            session.cmd_status_output(stop_cmd)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
error_context.context("Starting CPU stress test", LOG_JOB.info)
s, o = session.cmd_status_output(params["cmd_install_stressng"])
if s != 0:
test.error(f"Failed to install stress-ng: {o}")
session.cmd(params["cmd_run_stress"])
time.sleep(25)
guest_vals = _get_load_stats(session)
qga_vals = _get_load_stats(session, False)
_log_load_values(guest_vals, qga_vals, "Under stress")
if errors := _verify_load_values(qga_vals, guest_vals, "increase"):
test.fail("Stress test load check failed:\n" + "\n".join(errors))
prev_values = {"guest": guest_vals, "qga": qga_vals}
# sleep (60) wait for the stress-ng terminated.
time.sleep(60)
guest_vals = _get_load_stats(session)
qga_vals = _get_load_stats(session, False)
_log_load_values(guest_vals, qga_vals, "After stress")
if errors := _verify_load_values(qga_vals, guest_vals, "decrease"):
test.fail("Post-stress load check failed:\n" + "\n".join(errors))
prev_values = {"guest": guest_vals, "qga": qga_vals}
# Wait for stress to end, then ensure it's stopped.
time.sleep(60)
stop_cmd = params.get("cmd_stop_stress", "pkill -f stress-ng || true")
session.cmd_status_output(stop_cmd)
guest_vals = _get_load_stats(session)
qga_vals = _get_load_stats(session, False)
_log_load_values(guest_vals, qga_vals, "After stress")
if errors := _verify_load_values(qga_vals, guest_vals, "decrease"):
test.fail("Post-stress load check failed:\n" + "\n".join(errors))

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants