-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix CI fail for extended test (by freeing up more disk space in CI runner) #14745
Conversation
.github/workflows/extended.yml
Outdated
submodules: true | ||
fetch-depth: 1 | ||
- name: Free Disk Space (Ubuntu) | ||
uses: jlumbroso/[email protected] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This may possibly be a concern - the apache org limits what 3rd party actions are allowed though I am unsure if there is a list of them. I know I've encountered actions that were not allowed in the past. If it is a blocker you may have to request an exemption from the apache devops people. I think they also may want to use a git hash vs a version # to be sure it doesn't change underneath us.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you. I changed the version number to hash.
Regarding whether this action is in the approved list, this action is able to run in my cloned branch, so I guess it's allowed.
datafusion/core/tests/memory_limit/memory_limit_validation/sort_mem_validation.rs
Outdated
Show resolved
Hide resolved
@@ -95,7 +106,7 @@ jobs: | |||
- name: Run tests | |||
run: | | |||
cd datafusion | |||
cargo test --profile ci --exclude datafusion-examples --exclude datafusion-benchmarks --exclude datafusion-sqllogictest --workspace --lib --tests --features=force_hash_collisions,avro,extended_tests | |||
cargo test --profile ci --exclude datafusion-examples --exclude datafusion-benchmarks --exclude datafusion-sqllogictest --workspace --lib --tests --features=force_hash_collisions,avro |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why will we not run extended tests with hash collisions? We may lose some coverage here. If the only problematic thing is the memory-limited tests, we should place it under a separate flag and only skip those tests.
What I wrote above seems wrong - I did a search and the extended_tests
seems to be only gating these memory tests. If that is the case, maybe the name of the flag is not that great -- it led me to think we are disabling a bunch of other tests in addition to these ones.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree the fact that the flag extended_tests
and the workflow is named extended
is quite confusing
Maybe as a follow on PR we can rename the extended_test
flag somthing different like extra_tests
or extended_suite
🤔
…t_mem_validation.rs Co-authored-by: Bruce Ritchie <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @2010YOUY01 -- the idea of cleaning packages from the runners makes a lot of sense to me as does running the suite on its own though I am worried about the use of a third party github action
I don't understand the init_once
function
@@ -95,7 +106,7 @@ jobs: | |||
- name: Run tests | |||
run: | | |||
cd datafusion | |||
cargo test --profile ci --exclude datafusion-examples --exclude datafusion-benchmarks --exclude datafusion-sqllogictest --workspace --lib --tests --features=force_hash_collisions,avro,extended_tests | |||
cargo test --profile ci --exclude datafusion-examples --exclude datafusion-benchmarks --exclude datafusion-sqllogictest --workspace --lib --tests --features=force_hash_collisions,avro |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree the fact that the flag extended_tests
and the workflow is named extended
is quite confusing
Maybe as a follow on PR we can rename the extended_test
flag somthing different like extra_tests
or extended_suite
🤔
/// | ||
/// TODO: This is a hack, can be cleaned up if we have a better way to let multiple | ||
/// test cases run in different processes (instead of different threads by default) | ||
fn init_once() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't undersrtand how this avoids recompilation
It seems like recompilation would happen if the options / features were different
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe there are some subtle differences between the outer/inner test command to cause the recompilation, I tried to match the option and features but still can't work.
This init_once
is to solve the issue that different inner test commands can trigger multiple recompilation, I think there is no existing mechanism to coordinate parallel compilation, so this is just a lock to coordinate. Once the first inner test is compiled, other inner tests can use the same test binary without recompilation.
submodules: true | ||
fetch-depth: 1 | ||
- name: Free Disk Space (Ubuntu) | ||
uses: jlumbroso/free-disk-space@54081f138730dfa15788a46383842cd2f914a1be |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am a little worried about using a third-party action
It seems from the soure we could put a few shell command and get the same effect:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the idea is to review the third-party script, and extract safe deletions to an in-house script. Filed #14803 to address this idea.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we can get the tests clean I think this is good to go. I suspect we'll have to keep working at it to keep things stable
Thanks again @2010YOUY01
Which issue does this PR close?
Rationale for this change
CI for extended tests are failing because there is not enough disk space for CI runners, test binaries take lots of disk space.
This PR frees up more disk space on CI runner (12GB -> 45GB).
Before, after setting up Rust dependencies, there is only 12GB disk space left to run the tests, I think this can run into trouble if we want extended tests to do more stuff in the future.
What changes are included in this PR?
Memory validation test causes recompilations (I didn't find a direct fix 🤦🏼 ), and it caused large disk space usage. Before it can cause multiple recompilations (if we add more similar test cases, disk usage can continue grow).
This PR did uses a singleton (
init_once
) to make sure recompilation happens exactly once, and the disk usage won't grow indefinitely.Are these changes tested?
I tested on my local clone, and it works
https://github.com/2010YOUY01/arrow-datafusion/actions/runs/13387132954/job/37386562838
Are there any user-facing changes?
No