Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
[gpu] strict driver and cuda version assignment (#1275)
* [gpu] toward a more consistent driver and CUDA install gpu/install_gpu_driver.sh * exclusively using .run file installation method when available * build nccl from source * cache build artifacts from kernel driver and nccl * Tested more CUDA minor versions * gathering CUDA and driver version from URLs if passed * Printing warnings when combination provided is known to fail * waiting on apt lock when it exists * wrapping expensive functions in completion checks to reduce re-run time * fixed a problem with ops agent not installing ; using venv * Installing gcc-12 on ubuntu22 to fix kernel driver FTBFS * setting better spark defaults * skipping proxy setup if http-proxy metadata not set * added function to check secure-boot and os version compatability gpu/manual-test-runner.sh * order commands correctly gpu/test_gpu.py * clearer test skipping logic * added instructions on how to test pyspark * correcting driver for cuda 12.4 * correcting cuda subversion. 12.4.0 instead of 12.4.1 so that driver and cuda match up * corrected cannonical 11.8 driver version ; removed extra code and comment ; added better description of what is in the runfile * skipping most tests ; using 11.7 from the cuda 11 line instead of the less well supported 11.8 * verified that the cuda and driver versions match up * reducing log capture * temporarily increasing machine shape for build caching * 64 is too many for a single T4 * added a subversion for 11.7 * add more tests to the install function * only including architectures supported by this version of CUDA * pinning down versions better ; more caching ; more ram disks ; new pytorch and tensorflow test functions * using maximum from 8.9 series on rocky for 11.7 * skip full build * pinning to bazel-7.4.0 * NCCL requires gcc-11 for cuda11 * rocky8 is now building from the source in the .run file * reverting to previous state of only selecting a compiler version on latest releases * replaced literal path names with variable values ; indexing builds by the signing key used * moved variable definition to prepare function ; moved driver signing to build phase * test whether variable is defined before checking its value * cache only the bins and logs * build index of kernel modules after unpacking ; remove call to non-existent function * only build module dependency index once * skipping CUDA 11 NCCL build on debian12 * skip cuda11 on debian12, rocky9 * renamed verify_pyspark to verify_instance_pyspark * failing somewhat gracefully ; skipping tests that would fail * skipping single node tests for rocky8 * re-enable other tests * Specifying bazel version with variable * fixing up some skip logic * replaced OS_NAME with _shortname * skip more single instance tests for rocky8 * fixing indentation ; skipping redundant test * remove retries of flakey tests * oops ; need to define the cuda version to test for * passing -q to gcloud to generate empty passphrase if no ssh key exists ; selecting a more modern version of the 550 driver * including instructions on how to create a secure-boot key pair * -e for expert, not -p for pro * updated 11.8 and 12.0 driver versions * added a signature check test which allows granular selection of platform to test, but does not yet verify signatures * tuning the layout of arguments to userspace.run * scoping DEFAULT_CUDA_VERSION correctly ; exercising rocky including kerberos on 12.6 * add a connect timeout to the ssh call instead of trying to patch around a longer than expected connection delay * add some entropy to the process * perhaps a re-run would have fixed 2.0-rocky8 on that last run * increasing init action timeout to account for uncached builds * cache non-open kernel build results * per-kernel sub-directory for kmod tarballs * using upstream repo and branch * corrected grammar error * testing Kerberos some more * better implementation of numa node selection * this time with a test which is exercised * skip debian11 on Kerberos * also skipping 2.1-ubuntu20 on kerberos clusters * re-adjusting tests to be performed ; adjusting rather than skipping known failure cases * more temporal variance * skipping CUDA=12.0 for ubuntu22 * kerberos not known to succeed on 2.0-rocky8 * 2.2 dataproc images do not support CUDA <= 12.0 * skipping SINGLE configuration for rocky8 again * not testing 2.0 * trying without test retries ; retries should happen within the test, not by re-running the test * kerberos only works on 2.2 * using expectedFailure instead of skipTest for tests which are known to fail * document one of the failure states * skipping expected failures * updated manual-test-runner.sh instructions * this one generated from template after refactor * do not point to local rpm pgp key * re-ordering to reduce delta from master * custom image usage can come later * see #1283 * replaced incorrectly removed presubmit.sh and removed custom image key creation script intended to be removed in 70f37b6 * revert nearly to master * can include extended test suite later * order commands correctly * placing all completion files in a common directory * extend supported version list to include latest release of each minor version and their associated driver * tested with CUDA 11.6.2/510.108.03 * nccl build completes successfully on debian10 * account for nvidia-smi ABI change post 11.6 * exercised with cuda 11.1 * cleaned up nccl build and pack code a bit * no longer installing cudnn from local debian repo * unpacking nccl from cache immediately rather than waiting until later in the code * determine cudnn version by what is available in the repo * less noise from apt-mark hold * nccl build tested on 11.1 and 11.6 * account for abi change in nvidia-smi * reverting cloudbuild/Dockerfile to master * nvidia is 404ing for download.nvidia.com ; using us.download.nvidia.com * skipping rocky9 * * adding version 12.6 to the support matrix * changing layout of gcs package folder * install_pytorch function created and called when cuDNN is being installed * incorrect version check removed * only install pytorch if include-pytorch metadata set to true * since call to install_pytorch is protected by metadata check, skip metadata check within the function ; create new function harden_sshd_config and call it * increasing timeout and machine shape to reduce no-cache build time * skip full test run due to edits to integration_tests directory * ubuntu18 does not know about kex-gss ; use correct driver version number for cuda 11.1.1 url generation * on rocky9 sshd service is called sshd instead of ssh as the rest of the platforms call it * kex-gss is new in debian11 * all rocky call it sshd it seems * cudnn no longer available on debian10 * compared with #1282 ; this change matches parity more closely * slightly better variable declaration ordering ; it is better still in the templates/ directory from #1282 * install spark rapids * cache the results of nvidia-smi --query-gpu * reduce development time * exercising more CUDA variants ; testing whether tests fail on long runs * try to reduce concurrent builds ; extend build time further ; only enable spark rapids on images >= 2.1 * fixed bug with spark rapids version assignment ; more conservative about requirements for ramdisk ; roll back spark.SQLPlugin change * * gpu does not work on capacity scheduler on dataproc 2.0 ; use fair * protect against race condition on removing the .building files * add logic for pre-11.7 cuda package repo back in * clean up and verify yarn config * revert test_install_gpu_cuda_nvidia_with_spark_job cuda versions * configure for use with JupyterLab * 2.2 should use 12.6.3 (latest) * Addressing review from cnauroth gpu/install_gpu_driver.sh: * use the same retry arguments in all calls to curl * correct 12.3's driver and sub-version * improve logic for pause as other workers perform build * remove call to undefined clear_nvsmi_cache * move closing "fi" to line of its own * added comments for unclear logic * removed commented code * remove unused curl for latest driver version gpu/test_gpu.py * removed excess test * added comment about numa node selection * removed skips of rocky9 ; 2.2.44-rocky9 build succeeds * reverting changes to presubmit.sh
- Loading branch information