Fix CI fail for extended test (by freeing up more disk space in CI runner) #14745

2010YOUY01 · 2025-02-18T09:52:15Z

Which issue does this PR close?

Closes extended_test (with memory limit tracking) are commented out #14680

Rationale for this change

CI for extended tests are failing because there is not enough disk space for CI runners, test binaries take lots of disk space.

This PR frees up more disk space on CI runner (12GB -> 45GB).
Before, after setting up Rust dependencies, there is only 12GB disk space left to run the tests, I think this can run into trouble if we want extended tests to do more stuff in the future.

What changes are included in this PR?

CI change

Directly running test on Github CI runner, instead of Docker container, and manually setup Rust dependencies.
Applied Github Action https://github.com/jlumbroso/free-disk-space to delete unnecessary files on CI runner.

Test change
Memory validation test causes recompilations (I didn't find a direct fix 🤦🏼 ), and it caused large disk space usage. Before it can cause multiple recompilations (if we add more similar test cases, disk usage can continue grow).
This PR did uses a singleton (init_once) to make sure recompilation happens exactly once, and the disk usage won't grow indefinitely.

Are these changes tested?

I tested on my local clone, and it works
https://github.com/2010YOUY01/arrow-datafusion/actions/runs/13387132954/job/37386562838

Are there any user-facing changes?

No

Omega359 · 2025-02-18T18:45:19Z

.github/workflows/extended.yml

+          submodules: true
+          fetch-depth: 1
+      - name: Free Disk Space (Ubuntu)
+        uses: jlumbroso/[email protected]


This may possibly be a concern - the apache org limits what 3rd party actions are allowed though I am unsure if there is a list of them. I know I've encountered actions that were not allowed in the past. If it is a blocker you may have to request an exemption from the apache devops people. I think they also may want to use a git hash vs a version # to be sure it doesn't change underneath us.

Thank you. I changed the version number to hash.
Regarding whether this action is in the approved list, this action is able to run in my cloned branch, so I guess it's allowed.

datafusion/core/tests/memory_limit/memory_limit_validation/sort_mem_validation.rs

ozankabak · 2025-02-19T06:19:13Z

.github/workflows/extended.yml

@@ -95,7 +106,7 @@ jobs:
      - name: Run tests
        run: |
          cd datafusion
-          cargo test  --profile ci --exclude datafusion-examples --exclude datafusion-benchmarks --exclude datafusion-sqllogictest --workspace --lib --tests --features=force_hash_collisions,avro,extended_tests
+          cargo test  --profile ci --exclude datafusion-examples --exclude datafusion-benchmarks --exclude datafusion-sqllogictest --workspace --lib --tests --features=force_hash_collisions,avro


Why will we not run extended tests with hash collisions? We may lose some coverage here. If the only problematic thing is the memory-limited tests, we should place it under a separate flag and only skip those tests.

What I wrote above seems wrong - I did a search and the extended_tests seems to be only gating these memory tests. If that is the case, maybe the name of the flag is not that great -- it led me to think we are disabling a bunch of other tests in addition to these ones.

I agree the fact that the flag extended_tests and the workflow is named extended is quite confusing

Maybe as a follow on PR we can rename the extended_test flag somthing different like extra_tests or extended_suite 🤔

…t_mem_validation.rs Co-authored-by: Bruce Ritchie <[email protected]>

alamb

Thank you @2010YOUY01 -- the idea of cleaning packages from the runners makes a lot of sense to me as does running the suite on its own though I am worried about the use of a third party github action

I don't understand the init_once function

alamb · 2025-02-20T11:48:28Z

.github/workflows/extended.yml

@@ -95,7 +106,7 @@ jobs:
      - name: Run tests
        run: |
          cd datafusion
-          cargo test  --profile ci --exclude datafusion-examples --exclude datafusion-benchmarks --exclude datafusion-sqllogictest --workspace --lib --tests --features=force_hash_collisions,avro,extended_tests
+          cargo test  --profile ci --exclude datafusion-examples --exclude datafusion-benchmarks --exclude datafusion-sqllogictest --workspace --lib --tests --features=force_hash_collisions,avro


I agree the fact that the flag extended_tests and the workflow is named extended is quite confusing

Maybe as a follow on PR we can rename the extended_test flag somthing different like extra_tests or extended_suite 🤔

alamb · 2025-02-20T11:52:09Z

datafusion/core/tests/memory_limit/memory_limit_validation/sort_mem_validation.rs

+///
+/// TODO: This is a hack, can be cleaned up if we have a better way to let multiple
+/// test cases run in different processes (instead of different threads by default)
+fn init_once() {


I don't undersrtand how this avoids recompilation

It seems like recompilation would happen if the options / features were different

I believe there are some subtle differences between the outer/inner test command to cause the recompilation, I tried to match the option and features but still can't work.
This init_once is to solve the issue that different inner test commands can trigger multiple recompilation, I think there is no existing mechanism to coordinate parallel compilation, so this is just a lock to coordinate. Once the first inner test is compiled, other inner tests can use the same test binary without recompilation.

alamb · 2025-02-20T11:55:29Z

.github/workflows/extended.yml

+          submodules: true
+          fetch-depth: 1
+      - name: Free Disk Space (Ubuntu)
+        uses: jlumbroso/free-disk-space@54081f138730dfa15788a46383842cd2f914a1be


I am a little worried about using a third-party action

It seems from the soure we could put a few shell command and get the same effect:

https://github.com/jlumbroso/free-disk-space/blob/54081f138730dfa15788a46383842cd2f914a1be/action.yml#L149

https://github.com/jlumbroso/free-disk-space/blob/54081f138730dfa15788a46383842cd2f914a1be/action.yml#L161-L162

https://github.com/jlumbroso/free-disk-space/blob/54081f138730dfa15788a46383842cd2f914a1be/action.yml#L175-L185

I think the idea is to review the third-party script, and extract safe deletions to an in-house script. Filed #14803 to address this idea.

alamb

If we can get the tests clean I think this is good to go. I suspect we'll have to keep working at it to keep things stable

Thanks again @2010YOUY01

Fix extended test

15b995e

github-actions bot added development-process Related to development process of DataFusion core Core DataFusion crate labels Feb 18, 2025

Omega359 reviewed Feb 18, 2025

View reviewed changes

datafusion/core/tests/memory_limit/memory_limit_validation/sort_mem_validation.rs Outdated Show resolved Hide resolved

ozankabak reviewed Feb 19, 2025

View reviewed changes

2010YOUY01 and others added 3 commits February 19, 2025 14:54

feedback

8555d93

Update datafusion/core/tests/memory_limit/memory_limit_validation/sor…

8f0fe66

…t_mem_validation.rs Co-authored-by: Bruce Ritchie <[email protected]>

fix action version hash

af98850

alamb reviewed Feb 20, 2025

View reviewed changes

alamb approved these changes Feb 20, 2025

View reviewed changes

2010YOUY01 mentioned this pull request Feb 21, 2025

Inline disk cleanup script to replace third-party GitHub Action in CI #14803

Open

2010YOUY01 merged commit c92df4f into apache:main Feb 21, 2025
24 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix CI fail for extended test (by freeing up more disk space in CI runner) #14745

Fix CI fail for extended test (by freeing up more disk space in CI runner) #14745

2010YOUY01 commented Feb 18, 2025

Omega359 Feb 18, 2025

2010YOUY01 Feb 19, 2025

ozankabak Feb 19, 2025 •

edited

Loading

alamb Feb 20, 2025

alamb left a comment

alamb Feb 20, 2025

alamb Feb 20, 2025

2010YOUY01 Feb 21, 2025

alamb Feb 20, 2025

2010YOUY01 Feb 21, 2025

alamb left a comment

Fix CI fail for extended test (by freeing up more disk space in CI runner) #14745

Fix CI fail for extended test (by freeing up more disk space in CI runner) #14745

Conversation

2010YOUY01 commented Feb 18, 2025

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ozankabak Feb 19, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alamb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alamb left a comment

Choose a reason for hiding this comment

ozankabak Feb 19, 2025 •

edited

Loading