Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix CI fail for extended test (by freeing up more disk space in CI runner) #14745

Merged
merged 4 commits into from
Feb 21, 2025

Conversation

2010YOUY01
Copy link
Contributor

Which issue does this PR close?

Rationale for this change

CI for extended tests are failing because there is not enough disk space for CI runners, test binaries take lots of disk space.

This PR frees up more disk space on CI runner (12GB -> 45GB).
Before, after setting up Rust dependencies, there is only 12GB disk space left to run the tests, I think this can run into trouble if we want extended tests to do more stuff in the future.

What changes are included in this PR?

  1. CI change
  • Directly running test on Github CI runner, instead of Docker container, and manually setup Rust dependencies.
  • Applied Github Action https://github.com/jlumbroso/free-disk-space to delete unnecessary files on CI runner.
  1. Test change
    Memory validation test causes recompilations (I didn't find a direct fix 🤦🏼 ), and it caused large disk space usage. Before it can cause multiple recompilations (if we add more similar test cases, disk usage can continue grow).
    This PR did uses a singleton (init_once) to make sure recompilation happens exactly once, and the disk usage won't grow indefinitely.

Are these changes tested?

I tested on my local clone, and it works
https://github.com/2010YOUY01/arrow-datafusion/actions/runs/13387132954/job/37386562838

Are there any user-facing changes?

No

@github-actions github-actions bot added development-process Related to development process of DataFusion core Core DataFusion crate labels Feb 18, 2025
submodules: true
fetch-depth: 1
- name: Free Disk Space (Ubuntu)
uses: jlumbroso/[email protected]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may possibly be a concern - the apache org limits what 3rd party actions are allowed though I am unsure if there is a list of them. I know I've encountered actions that were not allowed in the past. If it is a blocker you may have to request an exemption from the apache devops people. I think they also may want to use a git hash vs a version # to be sure it doesn't change underneath us.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you. I changed the version number to hash.
Regarding whether this action is in the approved list, this action is able to run in my cloned branch, so I guess it's allowed.

@@ -95,7 +106,7 @@ jobs:
- name: Run tests
run: |
cd datafusion
cargo test --profile ci --exclude datafusion-examples --exclude datafusion-benchmarks --exclude datafusion-sqllogictest --workspace --lib --tests --features=force_hash_collisions,avro,extended_tests
cargo test --profile ci --exclude datafusion-examples --exclude datafusion-benchmarks --exclude datafusion-sqllogictest --workspace --lib --tests --features=force_hash_collisions,avro
Copy link
Contributor

@ozankabak ozankabak Feb 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why will we not run extended tests with hash collisions? We may lose some coverage here. If the only problematic thing is the memory-limited tests, we should place it under a separate flag and only skip those tests.

What I wrote above seems wrong - I did a search and the extended_tests seems to be only gating these memory tests. If that is the case, maybe the name of the flag is not that great -- it led me to think we are disabling a bunch of other tests in addition to these ones.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree the fact that the flag extended_tests and the workflow is named extended is quite confusing

Maybe as a follow on PR we can rename the extended_test flag somthing different like extra_tests or extended_suite 🤔

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @2010YOUY01 -- the idea of cleaning packages from the runners makes a lot of sense to me as does running the suite on its own though I am worried about the use of a third party github action

I don't understand the init_once function

@@ -95,7 +106,7 @@ jobs:
- name: Run tests
run: |
cd datafusion
cargo test --profile ci --exclude datafusion-examples --exclude datafusion-benchmarks --exclude datafusion-sqllogictest --workspace --lib --tests --features=force_hash_collisions,avro,extended_tests
cargo test --profile ci --exclude datafusion-examples --exclude datafusion-benchmarks --exclude datafusion-sqllogictest --workspace --lib --tests --features=force_hash_collisions,avro
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree the fact that the flag extended_tests and the workflow is named extended is quite confusing

Maybe as a follow on PR we can rename the extended_test flag somthing different like extra_tests or extended_suite 🤔

///
/// TODO: This is a hack, can be cleaned up if we have a better way to let multiple
/// test cases run in different processes (instead of different threads by default)
fn init_once() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't undersrtand how this avoids recompilation

It seems like recompilation would happen if the options / features were different

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe there are some subtle differences between the outer/inner test command to cause the recompilation, I tried to match the option and features but still can't work.
This init_once is to solve the issue that different inner test commands can trigger multiple recompilation, I think there is no existing mechanism to coordinate parallel compilation, so this is just a lock to coordinate. Once the first inner test is compiled, other inner tests can use the same test binary without recompilation.

submodules: true
fetch-depth: 1
- name: Free Disk Space (Ubuntu)
uses: jlumbroso/free-disk-space@54081f138730dfa15788a46383842cd2f914a1be
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the idea is to review the third-party script, and extract safe deletions to an in-house script. Filed #14803 to address this idea.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we can get the tests clean I think this is good to go. I suspect we'll have to keep working at it to keep things stable

Thanks again @2010YOUY01

@2010YOUY01 2010YOUY01 merged commit c92df4f into apache:main Feb 21, 2025
24 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate development-process Related to development process of DataFusion
Projects
None yet
Development

Successfully merging this pull request may close these issues.

extended_test (with memory limit tracking) are commented out
4 participants