-
Notifications
You must be signed in to change notification settings - Fork 512
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[template] create templates for use in generating actions #1282
Draft
cjac
wants to merge
131
commits into
GoogleCloudDataproc:master
Choose a base branch
from
LLC-Technologies-Collier:template-gpu-20241219
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Draft
Changes from all commits
Commits
Show all changes
131 commits
Select commit
Hold shift + click to select a range
4f49f65
[template] generate gpu/install_gpu_driver.sh from templates
cjac 1dae02b
new hold nvidia packages function ; moved variable definition around …
cjac e97e376
added two new gpu functions: configure_mig_cgi and enable_mig
cjac 310bb9d
templatized version of mig.sh
cjac 912ebe7
comment fix-up
cjac 87965de
nvidia-container-toolkit repo setup changes are working on rocky8
cjac 93fe4cc
defining variables in the generator script instead of duplicating in …
cjac b82aadc
tested with debian12
cjac dd98436
tested on 8x H100s with bookworm
cjac b4dabad
created and called function enable_and_configure_mig
cjac edeab28
moved comment to correct function
cjac 0e8946c
do not point to local rpm pgp key
cjac c44195a
store completion signal files in their own directory
cjac 31d1a9e
excessive sudo
cjac 4a3a8cd
install spark rapids in all cases
cjac 6ab36a5
merged spark-rapids functions into general gpu util_functions template
cjac ef36694
correcting variable name
cjac af69141
using new function name
cjac d59d5e6
driver version for 12.4.0 had not been tested in a while and had beco…
cjac 8a9e00a
expanding non-default version tests ; adding utility function to veri…
cjac 1113855
reduced boot disk size to 50GB
cjac 7034739
skipping old cuda on new images ; sizing instances to build
cjac 2873f49
skipping older debuntu when cuda version not specified
cjac 576b32f
refactor into functions
cjac b03dc57
moved secure-boot utility functions and common environment setup into…
cjac bf98d85
refactored exit_handler
cjac 4320953
declaring constants prior to running functions
cjac 2b0947b
removed old variables, included a current one which does not get exer…
cjac 5dbc1f2
do not break if variable undefined
cjac c5d46d3
order of operations error fixed with parantheses.
cjac 7be62b3
using lower xgboost version for older dataproc images
cjac 41c327a
test whether the variable is defined before testing its value
cjac b70477b
refactor the xgboost installer a little
cjac 073ed1f
only minor changes
cjac 4f66a51
explicitly notifying at the completion of the main function
cjac f8a9b7d
moved trap outside of the template
cjac 19520b4
stop / start instead of restart
cjac f659ec5
skipping install on gpu-less systems more quickly
cjac af817f0
install_dependencies is called from base template prep function
cjac 2e7441b
re-thought about the dependencies install time
cjac 29631a0
refactored configure_gpu_exclusive_mode to fewer lines
cjac 7ea7653
refactored gpu-related code out of common function library ; less rea…
cjac 70349a6
being more surgical about signing material usage
cjac b5473c5
removed dependency on pciutils ; defined is_debuntu with other os com…
cjac be3dbf6
again I meant elif
cjac 3ca8c91
fall back on metadata value if modulus_md5sum variable undefined
cjac 83d5ccc
switch to other build_dir variable assignment
cjac 93fdb30
parens
cjac 917f4b6
allow failure when grepping PCI devices for 10DE
cjac 668db72
removed listing of nodes_include ; does not work in custom-images con…
cjac 2193c28
min spark version supported by newer rapids is insufficient ; xgboost…
cjac 992d83a
skipping fewer tests
cjac 7170872
simplified rapids / xgboost default version logic
cjac b33cb27
ubuntu sometimes takes a while to bring gcloud online
cjac c56440a
only using 24.08.1 on 2.2 images ; fix a typo in a comment
cjac f10df49
refactored ; these files should be quite similar now
cjac f3a103e
returning spark-rapids/* to master ; this version of these templates …
cjac 0ac57a0
return test suite to master
cjac b3e5618
do not run all tests ; also do not retry failures
cjac b4e99ee
expanding non-default version tests ; adding utility function to veri…
cjac e4eab7b
reverting to master
cjac 95b17ac
reverting test_spark-rapids.py to master
cjac 212b9af
do not consider templates as changed files
cjac e9b9e5d
using nvsmi for some error protection
cjac adf4312
corrected comments
cjac dfcd8b0
defining xpath variables as local
cjac 9bb4d66
tested on 2.1-ubuntu20
cjac 17f0fe8
using tests from https://github.com/GoogleCloudDataproc/initializatio…
cjac f42a86d
reducing resources for build cluster ; pause for gcloud
cjac 811ad03
exercising spark-rapids from this template
cjac 4378edd
improved header documentation
cjac 992bd14
generated from templates in commit d5f7ffb7cf19852e48ce17c9ffae3640e7…
cjac 1d84952
replacing java spark tests with pyspark tests
cjac 88ccfec
pyspark test code
cjac 282ca0c
corrected function signature
cjac d6e9809
fixing order of operations for setting default cuda version ; removed…
cjac e3df6f2
including verify_pyspark.py in data list
cjac 89fe31b
verifying with gcloud dataproc jobs submit pyspark instead of spark ;…
cjac e221ede
re-enable ssh tests
cjac c9950a8
refactored ssh command retry code into the base class
cjac 8143d4c
remembered the imports ; sleep a random period
cjac 834f7d5
A100->H100
cjac 3d83795
fixing whitespace for python
cjac 7718e5a
moved knox variables to common env ; renamed ambiguous variable name
cjac aded30b
remove gpu related code from dask action
cjac f553371
changing failure to warning
cjac 0439a8d
removing more gpu stuff from dask
cjac 0251358
moved MASTER global variable to common/util_functions
cjac a643c9a
correct variable name
cjac d6867d9
moved hold_nvidia_packages out of common environment prepare into gpu…
cjac 6a7d10d
added comments and timing collection
cjac 1ab3f8d
no need to consider unsupported dataproc < 2.0 image versions ; reduc…
cjac 598b690
using "dask-scheduler" instead of "dask scheduler"
cjac 81c7d28
wait for dask scheduler before starting worker
cjac 48906e1
using variable instead of my own cluster master name
cjac 510e520
corrected syntax errors ; dump log on service failure
cjac 9e9f872
refactored some common code ; setting default value for metadata attr…
cjac 7480a23
added new function is_ramdisk ; keeping conda cache in its own direct…
cjac 6b73d22
calling functions from refactored pip setup/teardown
cjac 2e45a75
moved knox dask config to templates/dask/util_functions
cjac 33fdd38
added copyright to templates/legal/license_header
cjac 4f974c5
latest generated action
cjac 75d8e32
removed redundant template disclaimer
cjac 34fce25
setup and tear-down for actions which work with conda
cjac bbe062e
* refactored common conda installer functionality from dask.sh.in and
cjac 10f1698
tested rapids.sh init action with dataproc-repro
cjac 8a4cbd9
templates/dask/dask.sh.in,
cjac b01b867
refactor yarn functions into their own template
cjac c6c09db
refactor mig functions into their own template
cjac 88f9f7f
state before gpu rebranch
cjac 119f1b1
templates/common/util_functions:
cjac d45e16b
templates/dask/util_functions:
cjac a7b4707
refactored spark variable definition and reduced excess lines by bulk…
cjac 35ca704
development on these scripts will happen in the spark-rapids-template…
cjac 43232b2
revert dask/ to master
cjac 4b6e520
moving that .in suffix to the correct variable
cjac 4a024e0
reverted to master ; changes ended up in gpu-template-20250107
cjac f00e2f8
including libtemplate-perl as a dependency
cjac 7118ebf
moved to dask-template-20250104
cjac f2b50f7
moved to gpu-template-20250107
cjac 900c10a
* include version in template disclaimer
cjac bef08b1
migrated rapids.sh base template to rapids-template-20250106
cjac aa792c3
script to generate all actions from templates
cjac 824bcf8
spark prepare steps belong in common
cjac 374ff96
less noise in temp directory
cjac 5a37d94
tested with much older versions of CUDA on an old dataproc image from…
cjac 7662215
exercised older CUDA and mig a100 use case more ; added pytorch insta…
cjac 0c3eb51
create function to harden sshd config ; execute it before repairing o…
cjac 576bbb6
reviewed #1275 and brought closer to parity
cjac 07949a9
Merge branch 'GoogleCloudDataproc:master' into template-gpu-20241219
cjac 989b445
changes from testing PR #1275
cjac File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -49,7 +49,12 @@ initialize_git_repo() { | |
determine_tests_to_run() { | ||
# Infer the files that changed | ||
mapfile -t DELETED_BUILD_FILES < <(git diff origin/master --name-only --diff-filter=D | grep BUILD) | ||
mapfile -t CHANGED_FILES < <(git diff origin/master --name-only) | ||
mapfile -t CHANGED_FILES < <(git diff origin/master --name-only | grep -v template) | ||
for tt in $(git diff origin/master --name-only | grep 'templates/.*/.*\.sh\.in'); do | ||
local genfile=`perl -e "print( q{${tt}} =~ m:templates/(.*?.sh).in: )"` | ||
perl templates/generate-action.pl "${genfile}" > "${genfile}" | ||
CHANGED_FILES+=("${genfile}") | ||
done | ||
echo "Deleted BUILD files: ${DELETED_BUILD_FILES[*]}" | ||
echo "Changed files: ${CHANGED_FILES[*]}" | ||
|
||
|
@@ -70,6 +75,7 @@ determine_tests_to_run() { | |
changed_dir="${changed_dir%%/*}/" | ||
# Run all tests if common directories modified | ||
if [[ ${changed_dir} =~ ^(integration_tests|util|cloudbuild)/$ ]]; then | ||
continue | ||
echo "All tests will be run: '${changed_dir}' was changed" | ||
TESTS_TO_RUN=(":DataprocInitActionsTestSuite") | ||
return 0 | ||
|
@@ -104,7 +110,6 @@ run_tests() { | |
bazel test \ | ||
--jobs="${max_parallel_tests}" \ | ||
--local_test_jobs="${max_parallel_tests}" \ | ||
--flaky_test_attempts=3 \ | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The cost of flaky test attempts > 0 is that many times the tests do not succeed on retries, and instead we just wasted the provisioning and decommissioning of clusters. Tests take longer to fail, and this extends development time. |
||
--action_env="INTERNAL_IP_SSH=true" \ | ||
--test_output="all" \ | ||
--noshow_progress \ | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,53 @@ | ||
# | ||
# Generate repo file under /etc/apt/sources.list.d/ | ||
# | ||
function apt_add_repo() { | ||
local -r repo_name="$1" | ||
local -r repo_data="$3" # "http(s)://host/path/uri argument0 .. argumentN" | ||
local -r include_src="${4:-yes}" | ||
local -r kr_path="${5:-/usr/share/keyrings/${repo_name}.gpg}" | ||
local -r repo_path="${6:-/etc/apt/sources.list.d/${repo_name}.list}" | ||
|
||
echo "deb [signed-by=${kr_path}] ${repo_data}" > "${repo_path}" | ||
if [[ "${include_src}" == "yes" ]] ; then | ||
echo "deb-src [signed-by=${kr_path}] ${repo_data}" >> "${repo_path}" | ||
fi | ||
|
||
apt-get update -qq | ||
} | ||
|
||
# | ||
# Generate repo file under /etc/yum.repos.d/ | ||
# | ||
function dnf_add_repo() { | ||
local -r repo_name="$1" | ||
local -r repo_url="$3" # "http(s)://host/path/filename.repo" | ||
local -r kr_path="${5:-/etc/pki/rpm-gpg/${repo_name}.gpg}" | ||
local -r repo_path="${6:-/etc/yum.repos.d/${repo_name}.repo}" | ||
|
||
curl -s -L "${repo_url}" \ | ||
| dd of="${repo_path}" status=progress | ||
# | perl -p -e "s{^gpgkey=.*$}{gpgkey=file://${kr_path}}" \ | ||
} | ||
|
||
# | ||
# Keyrings default to | ||
# /usr/share/keyrings/${repo_name}.gpg (debian/ubuntu) or | ||
# /etc/pki/rpm-gpg/${repo_name}.gpg (rocky/RHEL) | ||
# | ||
function os_add_repo() { | ||
local -r repo_name="$1" | ||
local -r signing_key_url="$2" | ||
local -r repo_data="$3" # "http(s)://host/path/uri argument0 .. argumentN" | ||
local kr_path | ||
if is_debuntu ; then kr_path="${5:-/usr/share/keyrings/${repo_name}.gpg}" | ||
else kr_path="${5:-/etc/pki/rpm-gpg/${repo_name}.gpg}" ; fi | ||
|
||
mkdir -p "$(dirname "${kr_path}")" | ||
|
||
curl -fsS --retry-connrefused --retry 10 --retry-max-time 30 "${signing_key_url}" \ | ||
| gpg --import --no-default-keyring --keyring "${kr_path}" | ||
|
||
if is_debuntu ; then apt_add_repo "${repo_name}" "${signing_key_url}" "${repo_data}" "${4:-yes}" "${kr_path}" "${6:-}" | ||
else dnf_add_repo "${repo_name}" "${signing_key_url}" "${repo_data}" "${4:-yes}" "${kr_path}" "${6:-}" ; fi | ||
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,8 @@ | ||
# | ||
# Google Cloud Dataproc Initialization Actions v[% IA_VERSION %] | ||
# | ||
# This initialization action is generated from | ||
# initialization-actions/templates/[% template_path %].in | ||
# | ||
# Modifications made directly to generated files will be lost when the | ||
# templates are next evaluated. |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this needs to come out before squash + merge.