env_process: Refactor huge pages setup/cleanup steps #4054

bgartzi · 2025-01-17T17:39:20Z

Creating a new Setuper subclass for setting and cleaning huge pages up. Removing the original code from virttest.env_process and replacing it instead with the new HugePagesSetup class being registered in the setup_manager.

_pre_hugepages_surp and _post_hugepages_surp were left in env_process. Their goal is to provide a mechanism in env_process to raise a TestFail in case pages were leaked during a test. If that mechanism was refactored into the setuper, the TestFail would be masked by just an Error due to the way setup_manager handles postprocess exceptions. Changing the way SetupManager handles that requires bigger discussion on how the test infrastructure should handle test status reports, which is a way broader topic that what this patch aims to be.

This is a patch from a larger patch series refactoring the env_process preprocess and postprocess functions. In each of these patches, a pre/post process step is identified and replaced with a Setuper subclass so the following can finally be met:
- Only cleanup steps of successful setup steps are run to avoid possible environment corruption or hard to read errors.
- Running setup/cleanup steps symmetrically during env pre/post process.
- Reduce explicit pre/post process function code length.

Creating a new Setuper subclass for setting and cleaning huge pages up. Removing the original code from virttest.env_process and replacing it instead with the new HugePagesSetup class being registered in the setup_manager. _pre_hugepages_surp and _post_hugepages_surp were left in env_process. Their goal is to provide a mechanism in env_process to raise a TestFail in case pages were leaked during a test. If that mechanism was refactored into the setuper, the TestFail would be masked by just an Error due to the way setup_manager handles postprocess exceptions. Changing the way SetupManager handles that requires bigger discussion on how the test infrastructure should handle test status reports, which is a way broader topic that what this patch aims to be. This is a patch from a larger patch series refactoring the env_process preprocess and postprocess functions. In each of these patches, a pre/post process step is identified and replaced with a Setuper subclass so the following can finally be met: - Only cleanup steps of successful setup steps are run to avoid possible environment corruption or hard to read errors. - Running setup/cleanup steps symmetrically during env pre/post process. - Reduce explicit pre/post process function code length. Signed-off-by: Beñat Gartzia Arruabarrena <[email protected]>

YongxueHong · 2025-01-22T08:46:22Z

Hi @PaulYuuu
Could you help to review it? Thanks.

PaulYuuu · 2025-01-27T08:28:25Z

virttest/test_setup/memory.py

+            self.params["setup_hugepages"] = "yes"
+        if self.params.get("setup_hugepages") == "yes":
+            h = test_setup.HugePageConfig(self.params)
+            env_process._pre_hugepages_surp = h.ext_hugepages_surp


_pre_hugepages_surp and _post_hugepages_surp are for hugepage leak check, with this Setuper, I would suggest dropping them. by returning a variable after do_cleanup. so leak_num = _post_hugepages_surp - _pre_hugepages_surp can short to leak_num = <new_var>.

If I understood correctly, you are proposing to make the cleanup method of this Setuper return the leak_num value?
That would also involve updating the SetupManager behavior to meet the demands of HugePagesSetup. If that's the case, we would permit every Setuper return a value, which goes against the current implementation. We would then have to update the SetupManager do_cleanup logic to handle that.

In my opinion, if we were to do that, we would have to think of a protocol of some sort to make this implementable by each Setuper instead of adding specific Setuper logic into the rather general SetupManager. Could something like adding a post_cleanup_check function into the core Setuper and calling it after the cleanup method has been called from SetupManager.do_cleanup be the answer to that issue?

I also thought on other approaches, as implementing a core Singleton abstraction, so we would be able to reach Setuper instances instead of classes from within env_process so we could call extra functions on demand after the cleanup would have terminated. However, this approach sounds too complex for a workaround, and it could introduce further issues, as Setuper instances "surviving" from one test case run to the next one.

If I understood correctly, you are proposing to make the cleanup method of this Setuper return the leak_num value?
No, the workaround is the Setuper will calculate _post_hugepages_surp - _pre_hugepages_surp and set env_process._hugepage_leaks(take this name as example).

The complex solution you mentioned is that Setuper can return something. I agree we can do this, but not now, the implementation can closely combine env_process and Setuper.

Hi @bgartzi @PaulYuuu
First, I am confused about checking the huge page leak, whenever it will result in the leak error during the post process if the user does not deallocate the huge page by setting hugepages_deallocate = no.
Because it just deallocates the huge page memory with self.deallocate is True.

avocado-vt/virttest/test_setup/__init__.py

Lines 695 to 697 in 1d027e8

def cleanup(self):

if self.deallocate:

error_context.context("trying to deallocate hugepage memory")

So I think that there is a potential logical issue about the leak checking with self.deallocate being False, sum it up, the checking leak needs to work with the deallocation of the huge page otherwise it will raise the huge pages leaked! error all the time.
Hi, @PaulYuuu Please correct me if I misunderstood the implementation of the huge page configuration and the related post_process. Thanks.

Second, Is it possible that we integrate the checking leak part into the cleanup() instead of calling it independently in the post process? From my understanding of the code of the post process, we can see that each test case will check the huge page whether it is leaked, if the test case needs the huge page. Hi @PaulYuuu Could you share the reason for calling the checking leak here instead of the cleanup()

My reason is that we could use a unified way to raise all errors by the following:

avocado-vt/virttest/env_process.py

Lines 1608 to 1609 in 6ea7c8f

if err:

raise RuntimeError("Failures occurred while postprocess:\n%s" % err)

The err contains all the runtime errors during the postprocess. And I think the checking huge pages should be covered by the err

avocado-vt/virttest/env_process.py

Lines 1610 to 1612 in 6ea7c8f

elif _post_hugepages_surp > _pre_hugepages_surp:

leak_num = _post_hugepages_surp - _pre_hugepages_surp

raise exceptions.TestFail("%d huge pages leaked!" % leak_num)

As a result, we could let the user know the error occurred in the post process, rather than it is a exceptions.TestFail.
BTW, we hope that it could raise the related env_process error during the pre-process and post-process. The test case raises the exceptions.TestFail

So I suggest that we could refactor the previous implementation by integrating the checking leak part into the cleanup. And @bgartzi you could refactor the HugePagesSetup in the normal way.
Please let me know your opinions. @PaulYuuu @bgartzi
Thanks.

No, IIRC, if hugepages_deallocate is not set, we will never hit the error, HugePages_Surp only changes when leaked or using THP, which is unrelated to setup/teardown. And yes we should move self.ext_hugepages_surp = utils_memory.get_num_huge_pages_surp() out from the if self.deallocate in cleanup.

Without the Setuper, cannot handle huge page leak checks in the cleanup in the current pre/post process context, pre and post will init a separate HugePageConfig class, so we cannot directly check ext_hugepages_surp cross in 2 HugePageConfig.

For now, as we have each setuper, then check leak is possible, and we can do it, even in this PR or an individual one.

Following each discussion thread:

Deallocation + leak check

What should we do if deallocation doesn't happen? Should we check for leaks?

I see there isn't any test setting hugepages_deallocte. I see that it is set to hugepages_deallocate = yes by default here:

avocado-vt/virttest/shared/cfg/guest-hw.cfg

Lines 220 to 223 in 3fdae9b

# Similarly, on low mem systems if you deallocate hugepages and try

# to allocate them again it won't be possible to find contiguous pages,

# so make it possible to turn it off by setting this to 'no'.

hugepages_deallocate = yes

Based on 5b4db50, I see the deallocate option was added due to some small-memory PPC systems. So, it could be set to no for that reason.

I'm not a memory or hugepages expert, so I'm not sure if I understand this properly. But, based on what I understand by allocation/deallocation, here are my question:

If we allocate or mount huge pages, does the utils_memory.get_num_huge_pages_surp output value change? In other words, does the surp huge pages value change if we mount/unmount hugepages?

If that happens, I would expect the same to happen if we deallocate/unmount hugepages. Right?

If both of the previous are true: We measure ext_hugepages_surp before setup. So if we measure it after setup but before cleanup, the obtained value would be different. If we would run the leak check (AKA what triggers the TestFail) between setup and cleanup I'd expect the test to fail. Is that assumption correct?

Even if the previous one is true: if we don't deallocate, we wouldn't run cleanup at all, so I'd expect the utils_memory.get_num_huge_pages_surp value to be the same we would obtain right after setup. That is, different to pre-setup, where we measured ext_hugepages_surp for the last time. If that's correct, wouldn't getting self.ext_hugepages_surp = utils_memory.get_num_huge_pages_surp() out from the if self.deallocate clause cause all tests not deallocating (few, as far as I know) to fail?

Again, this logical reasoning is based on my limited understanding on the topic, so please, correct me.

Leak check failure trigger

Where should we trigger the failure/error if we detect a leak?

@YongxueHong, as you suggest, I would refactor the piece of code running the check into HugePagesSetup.cleanup. However, that brings me some concerns about the raised error type:

The check raises a TestFail. That sets the test status to "Failure".

If we write the check into HugePagesSetup.cleanup, the SetupManager will gather the raised exception (together with any other sort of exception) and will raise a RuntimeError. This would set the test status to "Error".

If we change this, we would change the behavior of all tests related to huge pages and how those register their result status.

That's my main concern: in summary, a test error not being the same as a test failure.

I believe we should limit ourselves to just postprocessing the test environment here, not declaring the result of the test during cleanup. However, I also understand that if we only can check ext_hugepages_surp only after huge pages umount, test cases don't have a way of checking that by themself either. If the spec of the huge pages tests declare that they should fail if there is a leak, then I guess we can't decide whether an error will suffice or not. So long story short: I'm not sure about masking the TestFail the leak check raises under a RuntimeError.

I'd like to hear your thoughts about this.

Following each discussion thread:

Deallocation + leak check

What should we do if deallocation doesn't happen? Should we check for leaks?

I see there isn't any test setting hugepages_deallocte. I see that it is set to hugepages_deallocate = yes by default here:

avocado-vt/virttest/shared/cfg/guest-hw.cfg

Lines 220 to 223 in 3fdae9b

# Similarly, on low mem systems if you deallocate hugepages and try

# to allocate them again it won't be possible to find contiguous pages,

# so make it possible to turn it off by setting this to 'no'.

hugepages_deallocate = yes

Based on 5b4db50, I see the deallocate option was added due to some small-memory PPC systems. So, it could be set to no for that reason.

I'm not a memory or hugepages expert, so I'm not sure if I understand this properly. But, based on what I understand by allocation/deallocation, here are my question:

If we allocate or mount huge pages, does the utils_memory.get_num_huge_pages_surp output value change? In other words, does the surp huge pages value change if we mount/unmount hugepages?

If that happens, I would expect the same to happen if we deallocate/unmount hugepages. Right?

If both of the previous are true: We measure ext_hugepages_surp before setup. So if we measure it after setup but before cleanup, the obtained value would be different. If we would run the leak check (AKA what triggers the TestFail) between setup and cleanup I'd expect the test to fail. Is that assumption correct?

Even if the previous one is true: if we don't deallocate, we wouldn't run cleanup at all, so I'd expect the utils_memory.get_num_huge_pages_surp value to be the same we would obtain right after setup. That is, different to pre-setup, where we measured ext_hugepages_surp for the last time. If that's correct, wouldn't getting self.ext_hugepages_surp = utils_memory.get_num_huge_pages_surp() out from the if self.deallocate clause cause all tests not deallocating (few, as far as I know) to fail?

Again, this logical reasoning is based on my limited understanding on the topic, so please, correct me.

Leak check failure trigger

Where should we trigger the failure/error if we detect a leak?

@YongxueHong, as you suggest, I would refactor the piece of code running the check into HugePagesSetup.cleanup. However, that brings me some concerns about the raised error type:

The check raises a TestFail. That sets the test status to "Failure".

If we write the check into HugePagesSetup.cleanup, the SetupManager will gather the raised exception (together with any other sort of exception) and will raise a RuntimeError. This would set the test status to "Error".

Hi @bgartzi
From my understanding of the env_proces, I think the purpose of raising the RuntimeError at the end of the post process is to collect all the errors and not stop to perform the next cleanup if there is one failed cleanup, so we can not raise each error type and in the meanwhile make sure all the cleanup can be executed.
In case we want to add another checkpoint to raise a TestFail later, how do we guarantee that all checkpoints will be executed? So I think we can let the RuntimeError(a unified error) cover the errors from each cleanup.

If we change this, we would change the behavior of all tests related to huge pages and how those register their result status.

That's my main concern: in summary, a test error not being the same as a test failure.

I believe we should limit ourselves to just postprocessing the test environment here, not declaring the result of the test during cleanup. However, I also understand that if we only can check ext_hugepages_surp only after huge pages umount, test cases don't have a way of checking that by themself either. If the spec of the huge pages tests declare that they should fail if there is a leak, then I guess we can't decide whether an error will suffice or not. So long story short: I'm not sure about masking the TestFail the leak check raises under a RuntimeError.

I'd like to hear your thoughts about this.

I think the key point is to understand the main responsibility of the post-process, even the pre-process.

I think the preprocess is mainly responsible for preparing the environment for the test case, for example, setting it up, installing the package, and so on. Instead, the post process works for cleaning up and uninstalling the package to keep the environment pure for the next test case. (We can not maintain and guide others in developing it if we do not define both the preprocess and postprocess.)

And if some errors occurred during the post process, we can not think it belongs to the TestFail or TestError, since those errors do not affect the result of the test case from the test case perspective, it just did some teardown steps finally.

Therefore, If this checkpoint is one step for the test case, I prefer to let the test case handle it, rather than in the post-process, otherwise, it should follow up the rule of the post-process which is to raise all the errors with a unified error type(Whatever the type of the error).
Sum it up: Let the test case raise the TestError, TestFail, and so on, instead in the pre-process and post-process.

Please let me know your thoughts.
Thanks.

Following each discussion thread:

Deallocation + leak check

What should we do if deallocation doesn't happen? Should we check for leaks?

I see there isn't any test setting hugepages_deallocte. I see that it is set to hugepages_deallocate = yes by default here:

avocado-vt/virttest/shared/cfg/guest-hw.cfg

Lines 220 to 223 in 3fdae9b

# Similarly, on low mem systems if you deallocate hugepages and try

# to allocate them again it won't be possible to find contiguous pages,

# so make it possible to turn it off by setting this to 'no'.

hugepages_deallocate = yes

Based on 5b4db50, I see the deallocate option was added due to some small-memory PPC systems. So, it could be set to no for that reason.
I'm not a memory or hugepages expert, so I'm not sure if I understand this properly. But, based on what I understand by allocation/deallocation, here are my question:

If we allocate or mount huge pages, does the utils_memory.get_num_huge_pages_surp output value change? In other words, does the surp huge pages value change if we mount/unmount hugepages?

If that happens, I would expect the same to happen if we deallocate/unmount hugepages. Right?

If both of the previous are true: We measure ext_hugepages_surp before setup. So if we measure it after setup but before cleanup, the obtained value would be different. If we would run the leak check (AKA what triggers the TestFail) between setup and cleanup I'd expect the test to fail. Is that assumption correct?

Even if the previous one is true: if we don't deallocate, we wouldn't run cleanup at all, so I'd expect the utils_memory.get_num_huge_pages_surp value to be the same we would obtain right after setup. That is, different to pre-setup, where we measured ext_hugepages_surp for the last time. If that's correct, wouldn't getting self.ext_hugepages_surp = utils_memory.get_num_huge_pages_surp() out from the if self.deallocate clause cause all tests not deallocating (few, as far as I know) to fail?

Again, this logical reasoning is based on my limited understanding on the topic, so please, correct me.

Leak check failure trigger

Where should we trigger the failure/error if we detect a leak?

@YongxueHong, as you suggest, I would refactor the piece of code running the check into HugePagesSetup.cleanup. However, that brings me some concerns about the raised error type:

The check raises a TestFail. That sets the test status to "Failure".

If we write the check into HugePagesSetup.cleanup, the SetupManager will gather the raised exception (together with any other sort of exception) and will raise a RuntimeError. This would set the test status to "Error".

Hi @bgartzi From my understanding of the env_proces, I think the purpose of raising the RuntimeError at the end of the post process is to collect all the errors and not stop to perform the next cleanup if there is one failed cleanup, so we can not raise each error type and in the meanwhile make sure all the cleanup can be executed. In case we want to add another checkpoint to raise a TestFail later, how do we guarantee that all checkpoints will be executed? So I think we can let the RuntimeError(a unified error) cover the errors from each cleanup.

Yes, that's how cleanup works. There's one way that comes to my mind, see:

diff --git a/virttest/test_setup/core.py b/virttest/test_setup/core.py --- a/virttest/test_setup/core.py +++ b/virttest/test_setup/core.py @@ -2,6 +2,7 @@ import logging from abc import ABCMeta, abstractmethod import six +from avocado.core import exceptions LOG = logging.getLogger("avocado." + __name__) @@ -94,11 +95,32 @@ class SetupManager(object): :return: Errors occurred in cleanup procedures. """ + cancels = [] + fails = [] errors = [] + warns = [] while self.__setupers: try: self.__setupers.pop().cleanup() + except exceptions.TestCancel as err: + LOG.error(str(err)) + cancels.append(str(err)) + except exceptions.TestFail as err: + LOG.error(str(err)) + fails.append(str(err)) + except exceptions.TestWarn as err: + LOG.warning(str(err)) + warns.append(str(err)) except Exception as err: LOG.error(str(err)) errors.append(str(err)) + if cancels: + msg = "\n".join(cancels) + raise exception.TestCancel(msg) + elif fails: + msg = "\n".join(fails) + raise exception.TestFail(msg) + elif not errors and warns: + msg = "\n".join(warns) + raise exception.TestWarn(msg) return errors

However, that implies some sort of error/fail/warning sort of hierarchy, and that we support postprocess steps to raise TestFails as well as other sort of errors. However, we already agreed that that is not a behavior we want to support. And if we did, I think that would be worth of a whole new separate patch.

As you mention, yes, I also agree we can just merge all errors into the RuntimeError that the Setuper's cleanup raises.

The problem is that if we do mask the leak detection under the RuntimeError, the behavior users (hugepages tests) would expect from avocado-vt will change. If you, as avocado-vt maintainers agree on changing that behavior, I'm also okay with that. Which based on the comments below, I think is the case.

If we change this, we would change the behavior of all tests related to huge pages and how those register their result status.

That's my main concern: in summary, a test error not being the same as a test failure.

I believe we should limit ourselves to just postprocessing the test environment here, not declaring the result of the test during cleanup. However, I also understand that if we only can check ext_hugepages_surp only after huge pages umount, test cases don't have a way of checking that by themself either. If the spec of the huge pages tests declare that they should fail if there is a leak, then I guess we can't decide whether an error will suffice or not. So long story short: I'm not sure about masking the TestFail the leak check raises under a RuntimeError.
I'd like to hear your thoughts about this.

I think the key point is to understand the main responsibility of the post-process, even the pre-process.

I think the preprocess is mainly responsible for preparing the environment for the test case, for example, setting it up, installing the package, and so on. Instead, the post process works for cleaning up and uninstalling the package to keep the environment pure for the next test case. (We can not maintain and guide others in developing it if we do not define both the preprocess and postprocess.)

And if some errors occurred during the post process, we can not think it belongs to the TestFail or TestError, since those errors do not affect the result of the test case from the test case perspective, it just did some teardown steps finally.

Therefore, If this checkpoint is one step for the test case, I prefer to let the test case handle it, rather than in the post-process, otherwise, it should follow up the rule of the post-process which is to raise all the errors with a unified error type(Whatever the type of the error). Sum it up: Let the test case raise the TestError, TestFail, and so on, instead in the pre-process and post-process.

Yes, I also agree that would be the ideal situation. However, based on my understanding, hugepage tests need the cleanup to be executed, that is they need to wait until hugepages are released until they check whether a leak happened or not. That's why the TestFail in this case was written in postprocess instead of in the test case itself, I think.

Please let me know your thoughts. Thanks.
Looking forward to know about yours. I think we're converging into an agreement. Thanks a lot!

PaulYuuu reviewed Jan 27, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

env_process: Refactor huge pages setup/cleanup steps #4054

env_process: Refactor huge pages setup/cleanup steps #4054

bgartzi commented Jan 17, 2025

YongxueHong commented Jan 22, 2025

PaulYuuu Jan 27, 2025

bgartzi Jan 27, 2025

PaulYuuu Jan 29, 2025

YongxueHong Feb 7, 2025

PaulYuuu Feb 7, 2025

bgartzi Feb 7, 2025

YongxueHong Feb 8, 2025

bgartzi Feb 11, 2025

	def cleanup(self):
	if self.deallocate:
	error_context.context("trying to deallocate hugepage memory")

	if err:
	raise RuntimeError("Failures occurred while postprocess:\n%s" % err)

	elif _post_hugepages_surp > _pre_hugepages_surp:
	leak_num = _post_hugepages_surp - _pre_hugepages_surp
	raise exceptions.TestFail("%d huge pages leaked!" % leak_num)

	# Similarly, on low mem systems if you deallocate hugepages and try
	# to allocate them again it won't be possible to find contiguous pages,
	# so make it possible to turn it off by setting this to 'no'.
	hugepages_deallocate = yes

env_process: Refactor huge pages setup/cleanup steps #4054

Are you sure you want to change the base?

env_process: Refactor huge pages setup/cleanup steps #4054

Conversation

bgartzi commented Jan 17, 2025

YongxueHong commented Jan 22, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Deallocation + leak check

What should we do if deallocation doesn't happen? Should we check for leaks?

Leak check failure trigger

Where should we trigger the failure/error if we detect a leak?

Choose a reason for hiding this comment

Deallocation + leak check

What should we do if deallocation doesn't happen? Should we check for leaks?

Leak check failure trigger

Where should we trigger the failure/error if we detect a leak?

Choose a reason for hiding this comment

Deallocation + leak check

What should we do if deallocation doesn't happen? Should we check for leaks?

Leak check failure trigger

Where should we trigger the failure/error if we detect a leak?