Skip to content

Commit

Permalink
[gpu] strict driver and cuda version assignment (#1275)
Browse files Browse the repository at this point in the history
* [gpu] toward a more consistent driver and CUDA install

gpu/install_gpu_driver.sh
  * exclusively using .run file installation method when available
  * build nccl from source
  * cache build artifacts from kernel driver and nccl
  * Tested more CUDA minor versions
  * gathering CUDA and driver version from URLs if passed
  * Printing warnings when combination provided is known to fail
  * waiting on apt lock when it exists
  * wrapping expensive functions in completion checks to reduce re-run time
  * fixed a problem with ops agent not installing ; using venv
  * Installing gcc-12 on ubuntu22 to fix kernel driver FTBFS
  * setting better spark defaults
  * skipping proxy setup if http-proxy metadata not set
  * added function to check secure-boot and os version compatability

gpu/manual-test-runner.sh
  * order commands correctly

gpu/test_gpu.py
  * clearer test skipping logic
  * added instructions on how to test pyspark

* correcting driver for cuda 12.4

* correcting cuda subversion.  12.4.0 instead of 12.4.1 so that driver and cuda match up

* corrected cannonical 11.8 driver version ; removed extra code and comment ; added better description of what is in the runfile

* skipping most tests ; using 11.7 from the cuda 11 line instead of the less well supported 11.8

* verified that the cuda and driver versions match up

* reducing log capture

* temporarily increasing machine shape for build caching

* 64 is too many for a single T4

* added a subversion for 11.7

* add more tests to the install function

* only including architectures supported by this version of CUDA

* pinning down versions better ; more caching ; more ram disks ; new pytorch and tensorflow test functions

* using maximum from 8.9 series on rocky for 11.7

* skip full build

* pinning to bazel-7.4.0

* NCCL requires gcc-11 for cuda11

* rocky8 is now building from the source in the .run file

* reverting to previous state of only selecting a compiler version on latest releases

* replaced literal path names with variable values ; indexing builds by the signing key used

* moved variable definition to prepare function ; moved driver signing to build phase

* test whether variable is defined before checking its value

* cache only the bins and logs

* build index of kernel modules after unpacking ; remove call to non-existent function

* only build module dependency index once

* skipping CUDA 11 NCCL build on debian12

* skip cuda11 on debian12, rocky9

* renamed verify_pyspark to verify_instance_pyspark

* failing somewhat gracefully ; skipping tests that would fail

* skipping single node tests for rocky8

* re-enable other tests

* Specifying bazel version with variable

* fixing up some skip logic

* replaced OS_NAME with _shortname

* skip more single instance tests for rocky8

* fixing indentation ; skipping redundant test

* remove retries of flakey tests

* oops ; need to define the cuda version to test for

* passing -q to gcloud to generate empty passphrase if no ssh key exists ; selecting a more modern version of the 550 driver

* including instructions on how to create a secure-boot key pair

* -e for expert, not -p for pro

* updated 11.8 and 12.0 driver versions

* added a signature check test which allows granular selection of platform to test, but does not yet verify signatures

* tuning the layout of arguments to userspace.run

* scoping DEFAULT_CUDA_VERSION correctly ; exercising rocky including kerberos on 12.6

* add a connect timeout to the ssh call instead of trying to patch around a longer than expected connection delay

* add some entropy to the process

* perhaps a re-run would have fixed 2.0-rocky8 on that last run

* increasing init action timeout to account for uncached builds

* cache non-open kernel build results

* per-kernel sub-directory for kmod tarballs

* using upstream repo and branch

* corrected grammar error

* testing Kerberos some more

* better implementation of numa node selection

* this time with a test which is exercised

* skip debian11 on Kerberos

* also skipping 2.1-ubuntu20 on kerberos clusters

* re-adjusting tests to be performed ; adjusting rather than skipping known failure cases

* more temporal variance

* skipping CUDA=12.0 for ubuntu22

* kerberos not known to succeed on 2.0-rocky8

* 2.2 dataproc images do not support CUDA <= 12.0

* skipping SINGLE configuration for rocky8 again

* not testing 2.0

* trying without test retries ; retries should happen within the test, not by re-running the test

* kerberos only works on 2.2

* using expectedFailure instead of skipTest for tests which are known to fail

* document one of the failure states

* skipping expected failures

* updated manual-test-runner.sh instructions

* this one generated from template after refactor

* do not point to local rpm pgp key

* re-ordering to reduce delta from master

* custom image usage can come later

* see #1283

* replaced incorrectly removed presubmit.sh and removed custom image key creation script intended to be removed in 70f37b6

* revert nearly to master

* can include extended test suite later

* order commands correctly

* placing all completion files in a common directory

* extend supported version list to include latest release of each minor version and their associated driver

* tested with CUDA 11.6.2/510.108.03

* nccl build completes successfully on debian10

* account for nvidia-smi ABI change post 11.6

* exercised with cuda 11.1

* cleaned up nccl build and pack code a bit
* no longer installing cudnn from local debian repo
* unpacking nccl from cache immediately rather than waiting until
  later in the code
* determine cudnn version by what is available in the repo
* less noise from apt-mark hold
* nccl build tested on 11.1 and 11.6
* account for abi change in nvidia-smi

* reverting cloudbuild/Dockerfile to master

* nvidia is 404ing for download.nvidia.com ; using us.download.nvidia.com

* skipping rocky9

* * adding version 12.6 to the support matrix
* changing layout of gcs package folder
* install_pytorch function created and called when cuDNN is being installed

* incorrect version check removed

* only install pytorch if include-pytorch metadata set to true

* since call to install_pytorch is protected by metadata check, skip metadata check within the function ; create new function harden_sshd_config and call it

* increasing timeout and machine shape to reduce no-cache build time

* skip full test run due to edits to integration_tests directory

* ubuntu18 does not know about kex-gss ; use correct driver version number for cuda 11.1.1 url generation

* on rocky9 sshd service is called sshd instead of ssh as the rest of the platforms call it

* kex-gss is new in debian11

* all rocky call it sshd it seems

* cudnn no longer available on debian10

* compared with #1282 ; this change matches parity more closely

* slightly better variable declaration ordering ; it is better still in the templates/ directory from #1282

* install spark rapids

* cache the results of nvidia-smi --query-gpu

* reduce development time

* exercising more CUDA variants ; testing whether tests fail on long runs

* try to reduce concurrent builds ; extend build time further ; only enable spark rapids on images >= 2.1

* fixed bug with spark rapids version assignment ; more conservative about requirements for ramdisk ; roll back spark.SQLPlugin change

* * gpu does not work on capacity scheduler on dataproc 2.0 ; use fair
* protect against race condition on removing the .building files
* add logic for pre-11.7 cuda package repo back in
* clean up and verify yarn config

* revert test_install_gpu_cuda_nvidia_with_spark_job cuda versions

* configure for use with JupyterLab

* 2.2 should use 12.6.3 (latest)

* Addressing review from cnauroth

gpu/install_gpu_driver.sh:
* use the same retry arguments in all calls to curl
* correct 12.3's driver and sub-version
* improve logic for pause as other workers perform build
* remove call to undefined clear_nvsmi_cache
* move closing "fi" to line of its own
* added comments for unclear logic
* removed commented code
* remove unused curl for latest driver version

gpu/test_gpu.py
* removed excess test
* added comment about numa node selection
* removed skips of rocky9 ; 2.2.44-rocky9 build succeeds

* reverting changes to presubmit.sh
  • Loading branch information
cjac authored Feb 6, 2025
1 parent 0b3165c commit 87e2f91
Show file tree
Hide file tree
Showing 6 changed files with 1,527 additions and 511 deletions.
14 changes: 10 additions & 4 deletions gpu/Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -15,19 +15,25 @@ RUN apt-get -qq update \
curl jq less screen > /dev/null 2>&1 && apt-get clean

# Install bazel signing key, repo and package
ENV bazel_kr_path=/usr/share/keyrings/bazel-release.pub.gpg
ENV bazel_repo_data="http://storage.googleapis.com/bazel-apt stable jdk1.8"
ENV bazel_kr_path=/usr/share/keyrings/bazel-keyring.gpg \
bazel_version=7.4.0 \
bazel_repo_data="http://storage.googleapis.com/bazel-apt stable jdk1.8" \
DEBIAN_FRONTEND=noninteractive

RUN /usr/bin/curl -s https://bazel.build/bazel-release.pub.gpg \
| gpg --dearmor -o "${bazel_kr_path}" \
&& echo "deb [arch=amd64 signed-by=${bazel_kr_path}] ${bazel_repo_data}" \
| dd of=/etc/apt/sources.list.d/bazel.list status=none \
&& apt-get update -qq

RUN apt-get autoremove -y -qq && \
apt-get install -y -qq default-jdk python3-setuptools bazel > /dev/null 2>&1 && \
RUN apt-get autoremove -y -qq > /dev/null 2>&1 && \
apt-get install -y -qq default-jdk python3-setuptools bazel-${bazel_version} > /dev/null 2>&1 && \
apt-get clean

# Set bazel-${bazel_version} as the default bazel alternative in this container
RUN update-alternatives --install /usr/bin/bazel bazel /usr/bin/bazel-${bazel_version} 1 && \
update-alternatives --set bazel /usr/bin/bazel-${bazel_version}

# Install here any utilities you find useful when troubleshooting
RUN apt-get -y -qq install emacs-nox vim uuid-runtime > /dev/null 2>&1 && apt-get clean

Expand Down
Loading

0 comments on commit 87e2f91

Please sign in to comment.