-
Notifications
You must be signed in to change notification settings - Fork 512
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[template] create templates for use in generating actions #1282
base: master
Are you sure you want to change the base?
[template] create templates for use in generating actions #1282
Conversation
templates/generate-action.pl
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hello @davorg - As I was reading through the literature, bringing myself back up to speed on the state of the art of template toolkits, I saw that there was a book with one of my friends' names on it that I had glanced at many times over the last couple of decades. I did not realize until just a few days ago that the dlc who was in charge of desk allocation in my cube farm when I started was the same dlc who wrote the book on this particular subject.
Anyway, I've been thinking of you and our peers as I've been hacking away at this installer. If you felt like looking things over and picking some nits, I'd love to hear your feedback. I hope your holidays are merry and all that!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@shlomif oh, hey I see that you are actively participating in Template.pm development. I'm not doing a lot with it in this repository; everything is pretty straightforward, I think. If you had some spare time to take a peek at the new templates/
directory in this repo, and especially the templates/generate-action.pl
, it might be fun to chat about it. I hope your holidays went well!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cjac: hi! Where can I find the templates directory? Please give a url.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@shlomif thanks for your prompt response! That URL is
https://github.com/LLC-Technologies-Collier/initialization-actions/tree/template-gpu-20241219
/gcbrun |
1 similar comment
/gcbrun |
8d28938
to
e511a6e
Compare
/gcbrun |
1 similar comment
/gcbrun |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added some comments to address issues with documentation
templates/spark-rapids/mig.sh.in
Outdated
# --metadata=ENABLE_MIG can be used to enable or disable MIG. The default is to enable it. | ||
# The script does a reboot to fully enable MIG and then configures the MIG device based on the | ||
# user specified MIG_CGI profiles specified via: --metadata=^:^MIG_CGI='9,9'. If MIG_CGI | ||
# is not specified it assumes it's using an A100 and configures 2 instances with profile id 9. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/A100/H100/
templates/spark-rapids/mig.sh.in
Outdated
# | ||
# This script should be specified in --metadata=startup-script-url= option and | ||
# --metadata=ENABLE_MIG can be used to enable or disable MIG. The default is to enable it. | ||
# The script does a reboot to fully enable MIG and then configures the MIG device based on the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does not ever reboot, and neither should you
/gcbrun |
2 similar comments
/gcbrun |
/gcbrun |
using the test suite I just cleaned up for #1275 |
/gcbrun |
2.1-debian11 failure:
|
/gcbrun |
3 similar comments
/gcbrun |
/gcbrun |
/gcbrun |
well that's good news, then. |
/gcbrun |
/gcbrun |
* include version in action generator
763e1ff
to
900c10a
Compare
templates/generate-action.pl
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dlc - Can I get a review of the templates/ directory in this repository, please? I tried to keep it simple for the initial implementation, but if you have any advice about how we can further reduce duplication, I'd be all ears. I'm thinking about picking up your book and getting into the minutia, but the PR will be closed far before then, I hope!
cea2aa3
to
2afff45
Compare
2afff45
to
aa792c3
Compare
Those instructions seem right, but it may take less time with the new versions, since much of the work is now cached, and when the memory is sufficient, installation utilizes ram disks. Maybe mention that with new custom images, secure boot can be enabled. The new custom image script requires that the secret manager api service be enabled for the project. And yes, new actions will be generated from templates on each, now versioned, release. |
effd8b5
to
374ff96
Compare
76a6df6
to
0c3eb51
Compare
… the templates/ directory from GoogleCloudDataproc#1282
* [gpu] toward a more consistent driver and CUDA install gpu/install_gpu_driver.sh * exclusively using .run file installation method when available * build nccl from source * cache build artifacts from kernel driver and nccl * Tested more CUDA minor versions * gathering CUDA and driver version from URLs if passed * Printing warnings when combination provided is known to fail * waiting on apt lock when it exists * wrapping expensive functions in completion checks to reduce re-run time * fixed a problem with ops agent not installing ; using venv * Installing gcc-12 on ubuntu22 to fix kernel driver FTBFS * setting better spark defaults * skipping proxy setup if http-proxy metadata not set * added function to check secure-boot and os version compatability gpu/manual-test-runner.sh * order commands correctly gpu/test_gpu.py * clearer test skipping logic * added instructions on how to test pyspark * correcting driver for cuda 12.4 * correcting cuda subversion. 12.4.0 instead of 12.4.1 so that driver and cuda match up * corrected cannonical 11.8 driver version ; removed extra code and comment ; added better description of what is in the runfile * skipping most tests ; using 11.7 from the cuda 11 line instead of the less well supported 11.8 * verified that the cuda and driver versions match up * reducing log capture * temporarily increasing machine shape for build caching * 64 is too many for a single T4 * added a subversion for 11.7 * add more tests to the install function * only including architectures supported by this version of CUDA * pinning down versions better ; more caching ; more ram disks ; new pytorch and tensorflow test functions * using maximum from 8.9 series on rocky for 11.7 * skip full build * pinning to bazel-7.4.0 * NCCL requires gcc-11 for cuda11 * rocky8 is now building from the source in the .run file * reverting to previous state of only selecting a compiler version on latest releases * replaced literal path names with variable values ; indexing builds by the signing key used * moved variable definition to prepare function ; moved driver signing to build phase * test whether variable is defined before checking its value * cache only the bins and logs * build index of kernel modules after unpacking ; remove call to non-existent function * only build module dependency index once * skipping CUDA 11 NCCL build on debian12 * skip cuda11 on debian12, rocky9 * renamed verify_pyspark to verify_instance_pyspark * failing somewhat gracefully ; skipping tests that would fail * skipping single node tests for rocky8 * re-enable other tests * Specifying bazel version with variable * fixing up some skip logic * replaced OS_NAME with _shortname * skip more single instance tests for rocky8 * fixing indentation ; skipping redundant test * remove retries of flakey tests * oops ; need to define the cuda version to test for * passing -q to gcloud to generate empty passphrase if no ssh key exists ; selecting a more modern version of the 550 driver * including instructions on how to create a secure-boot key pair * -e for expert, not -p for pro * updated 11.8 and 12.0 driver versions * added a signature check test which allows granular selection of platform to test, but does not yet verify signatures * tuning the layout of arguments to userspace.run * scoping DEFAULT_CUDA_VERSION correctly ; exercising rocky including kerberos on 12.6 * add a connect timeout to the ssh call instead of trying to patch around a longer than expected connection delay * add some entropy to the process * perhaps a re-run would have fixed 2.0-rocky8 on that last run * increasing init action timeout to account for uncached builds * cache non-open kernel build results * per-kernel sub-directory for kmod tarballs * using upstream repo and branch * corrected grammar error * testing Kerberos some more * better implementation of numa node selection * this time with a test which is exercised * skip debian11 on Kerberos * also skipping 2.1-ubuntu20 on kerberos clusters * re-adjusting tests to be performed ; adjusting rather than skipping known failure cases * more temporal variance * skipping CUDA=12.0 for ubuntu22 * kerberos not known to succeed on 2.0-rocky8 * 2.2 dataproc images do not support CUDA <= 12.0 * skipping SINGLE configuration for rocky8 again * not testing 2.0 * trying without test retries ; retries should happen within the test, not by re-running the test * kerberos only works on 2.2 * using expectedFailure instead of skipTest for tests which are known to fail * document one of the failure states * skipping expected failures * updated manual-test-runner.sh instructions * this one generated from template after refactor * do not point to local rpm pgp key * re-ordering to reduce delta from master * custom image usage can come later * see #1283 * replaced incorrectly removed presubmit.sh and removed custom image key creation script intended to be removed in 70f37b6 * revert nearly to master * can include extended test suite later * order commands correctly * placing all completion files in a common directory * extend supported version list to include latest release of each minor version and their associated driver * tested with CUDA 11.6.2/510.108.03 * nccl build completes successfully on debian10 * account for nvidia-smi ABI change post 11.6 * exercised with cuda 11.1 * cleaned up nccl build and pack code a bit * no longer installing cudnn from local debian repo * unpacking nccl from cache immediately rather than waiting until later in the code * determine cudnn version by what is available in the repo * less noise from apt-mark hold * nccl build tested on 11.1 and 11.6 * account for abi change in nvidia-smi * reverting cloudbuild/Dockerfile to master * nvidia is 404ing for download.nvidia.com ; using us.download.nvidia.com * skipping rocky9 * * adding version 12.6 to the support matrix * changing layout of gcs package folder * install_pytorch function created and called when cuDNN is being installed * incorrect version check removed * only install pytorch if include-pytorch metadata set to true * since call to install_pytorch is protected by metadata check, skip metadata check within the function ; create new function harden_sshd_config and call it * increasing timeout and machine shape to reduce no-cache build time * skip full test run due to edits to integration_tests directory * ubuntu18 does not know about kex-gss ; use correct driver version number for cuda 11.1.1 url generation * on rocky9 sshd service is called sshd instead of ssh as the rest of the platforms call it * kex-gss is new in debian11 * all rocky call it sshd it seems * cudnn no longer available on debian10 * compared with #1282 ; this change matches parity more closely * slightly better variable declaration ordering ; it is better still in the templates/ directory from #1282 * install spark rapids * cache the results of nvidia-smi --query-gpu * reduce development time * exercising more CUDA variants ; testing whether tests fail on long runs * try to reduce concurrent builds ; extend build time further ; only enable spark rapids on images >= 2.1 * fixed bug with spark rapids version assignment ; more conservative about requirements for ramdisk ; roll back spark.SQLPlugin change * * gpu does not work on capacity scheduler on dataproc 2.0 ; use fair * protect against race condition on removing the .building files * add logic for pre-11.7 cuda package repo back in * clean up and verify yarn config * revert test_install_gpu_cuda_nvidia_with_spark_job cuda versions * configure for use with JupyterLab * 2.2 should use 12.6.3 (latest) * Addressing review from cnauroth gpu/install_gpu_driver.sh: * use the same retry arguments in all calls to curl * correct 12.3's driver and sub-version * improve logic for pause as other workers perform build * remove call to undefined clear_nvsmi_cache * move closing "fi" to line of its own * added comments for unclear logic * removed commented code * remove unused curl for latest driver version gpu/test_gpu.py * removed excess test * added comment about numa node selection * removed skips of rocky9 ; 2.2.44-rocky9 build succeeds * reverting changes to presubmit.sh
TODO: apply these changes from gpu driver installer into common templates: |
This PR should resolve #1276 and is an attempt at better solving the problem space of #1030
I believe that #1259 could be implemented easier using this change, but its dependency on rebooting is antithetical to Dataproc in many ways and has not been included. I will meet with NVIDIA and the Dataproc engineering team to troubleshoot the problem.
This PR includes code refactored out of GPU-acceleration-related and dask-related actions and into files under the templates/ directory of the repository. There are a set of PRs which rebase to this branch: