Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[template] create templates for use in generating actions #1282

Draft
wants to merge 131 commits into
base: master
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from 121 commits
Commits
Show all changes
131 commits
Select commit Hold shift + click to select a range
4f49f65
[template] generate gpu/install_gpu_driver.sh from templates
cjac Dec 20, 2024
1dae02b
new hold nvidia packages function ; moved variable definition around …
cjac Dec 20, 2024
e97e376
added two new gpu functions: configure_mig_cgi and enable_mig
cjac Dec 20, 2024
310bb9d
templatized version of mig.sh
cjac Dec 20, 2024
912ebe7
comment fix-up
cjac Dec 20, 2024
87965de
nvidia-container-toolkit repo setup changes are working on rocky8
cjac Dec 20, 2024
93fe4cc
defining variables in the generator script instead of duplicating in …
cjac Dec 21, 2024
b82aadc
tested with debian12
cjac Dec 21, 2024
dd98436
tested on 8x H100s with bookworm
cjac Dec 21, 2024
b4dabad
created and called function enable_and_configure_mig
cjac Dec 22, 2024
edeab28
moved comment to correct function
cjac Dec 23, 2024
0e8946c
do not point to local rpm pgp key
cjac Dec 24, 2024
c44195a
store completion signal files in their own directory
cjac Dec 25, 2024
31d1a9e
excessive sudo
cjac Dec 25, 2024
4a3a8cd
install spark rapids in all cases
cjac Dec 25, 2024
6ab36a5
merged spark-rapids functions into general gpu util_functions template
cjac Dec 25, 2024
ef36694
correcting variable name
cjac Dec 25, 2024
af69141
using new function name
cjac Dec 25, 2024
d59d5e6
driver version for 12.4.0 had not been tested in a while and had beco…
cjac Dec 26, 2024
8a9e00a
expanding non-default version tests ; adding utility function to veri…
cjac Dec 26, 2024
1113855
reduced boot disk size to 50GB
cjac Dec 26, 2024
7034739
skipping old cuda on new images ; sizing instances to build
cjac Dec 26, 2024
2873f49
skipping older debuntu when cuda version not specified
cjac Dec 26, 2024
576b32f
refactor into functions
cjac Dec 26, 2024
b03dc57
moved secure-boot utility functions and common environment setup into…
cjac Dec 26, 2024
bf98d85
refactored exit_handler
cjac Dec 26, 2024
4320953
declaring constants prior to running functions
cjac Dec 26, 2024
2b0947b
removed old variables, included a current one which does not get exer…
cjac Dec 27, 2024
5dbc1f2
do not break if variable undefined
cjac Dec 27, 2024
c5d46d3
order of operations error fixed with parantheses.
cjac Dec 27, 2024
7be62b3
using lower xgboost version for older dataproc images
cjac Dec 27, 2024
41c327a
test whether the variable is defined before testing its value
cjac Dec 27, 2024
b70477b
refactor the xgboost installer a little
cjac Dec 27, 2024
073ed1f
only minor changes
cjac Dec 27, 2024
4f66a51
explicitly notifying at the completion of the main function
cjac Dec 27, 2024
f8a9b7d
moved trap outside of the template
cjac Dec 27, 2024
19520b4
stop / start instead of restart
cjac Dec 27, 2024
f659ec5
skipping install on gpu-less systems more quickly
cjac Dec 27, 2024
af817f0
install_dependencies is called from base template prep function
cjac Dec 27, 2024
2e7441b
re-thought about the dependencies install time
cjac Dec 27, 2024
29631a0
refactored configure_gpu_exclusive_mode to fewer lines
cjac Dec 27, 2024
7ea7653
refactored gpu-related code out of common function library ; less rea…
cjac Dec 27, 2024
70349a6
being more surgical about signing material usage
cjac Dec 27, 2024
b5473c5
removed dependency on pciutils ; defined is_debuntu with other os com…
cjac Dec 27, 2024
be3dbf6
again I meant elif
cjac Dec 27, 2024
3ca8c91
fall back on metadata value if modulus_md5sum variable undefined
cjac Dec 28, 2024
83d5ccc
switch to other build_dir variable assignment
cjac Dec 28, 2024
93fdb30
parens
cjac Dec 28, 2024
917f4b6
allow failure when grepping PCI devices for 10DE
cjac Dec 28, 2024
668db72
removed listing of nodes_include ; does not work in custom-images con…
cjac Dec 28, 2024
2193c28
min spark version supported by newer rapids is insufficient ; xgboost…
cjac Dec 28, 2024
992d83a
skipping fewer tests
cjac Dec 28, 2024
7170872
simplified rapids / xgboost default version logic
cjac Dec 28, 2024
b33cb27
ubuntu sometimes takes a while to bring gcloud online
cjac Dec 28, 2024
c56440a
only using 24.08.1 on 2.2 images ; fix a typo in a comment
cjac Dec 28, 2024
f10df49
refactored ; these files should be quite similar now
cjac Dec 29, 2024
f3a103e
returning spark-rapids/* to master ; this version of these templates …
cjac Jan 2, 2025
0ac57a0
return test suite to master
cjac Jan 2, 2025
b3e5618
do not run all tests ; also do not retry failures
cjac Jan 2, 2025
b4e99ee
expanding non-default version tests ; adding utility function to veri…
cjac Dec 26, 2024
e4eab7b
reverting to master
cjac Jan 2, 2025
95b17ac
reverting test_spark-rapids.py to master
cjac Jan 2, 2025
212b9af
do not consider templates as changed files
cjac Jan 2, 2025
e9b9e5d
using nvsmi for some error protection
cjac Jan 2, 2025
adf4312
corrected comments
cjac Jan 3, 2025
dfcd8b0
defining xpath variables as local
cjac Jan 3, 2025
9bb4d66
tested on 2.1-ubuntu20
cjac Jan 3, 2025
17f0fe8
using tests from https://github.com/GoogleCloudDataproc/initializatio…
cjac Jan 3, 2025
f42a86d
reducing resources for build cluster ; pause for gcloud
cjac Jan 3, 2025
811ad03
exercising spark-rapids from this template
cjac Jan 4, 2025
4378edd
improved header documentation
cjac Jan 4, 2025
992bd14
generated from templates in commit d5f7ffb7cf19852e48ce17c9ffae3640e7…
cjac Jan 4, 2025
1d84952
replacing java spark tests with pyspark tests
cjac Jan 4, 2025
88ccfec
pyspark test code
cjac Jan 4, 2025
282ca0c
corrected function signature
cjac Jan 4, 2025
d6e9809
fixing order of operations for setting default cuda version ; removed…
cjac Jan 4, 2025
e3df6f2
including verify_pyspark.py in data list
cjac Jan 4, 2025
89fe31b
verifying with gcloud dataproc jobs submit pyspark instead of spark ;…
cjac Jan 4, 2025
e221ede
re-enable ssh tests
cjac Jan 4, 2025
c9950a8
refactored ssh command retry code into the base class
cjac Jan 4, 2025
8143d4c
remembered the imports ; sleep a random period
cjac Jan 4, 2025
834f7d5
A100->H100
cjac Jan 4, 2025
3d83795
fixing whitespace for python
cjac Jan 4, 2025
7718e5a
moved knox variables to common env ; renamed ambiguous variable name
cjac Jan 5, 2025
aded30b
remove gpu related code from dask action
cjac Jan 5, 2025
f553371
changing failure to warning
cjac Jan 5, 2025
0439a8d
removing more gpu stuff from dask
cjac Jan 5, 2025
0251358
moved MASTER global variable to common/util_functions
cjac Jan 5, 2025
a643c9a
correct variable name
cjac Jan 5, 2025
d6867d9
moved hold_nvidia_packages out of common environment prepare into gpu…
cjac Jan 5, 2025
6a7d10d
added comments and timing collection
cjac Jan 6, 2025
1ab3f8d
no need to consider unsupported dataproc < 2.0 image versions ; reduc…
cjac Jan 6, 2025
598b690
using "dask-scheduler" instead of "dask scheduler"
cjac Jan 6, 2025
81c7d28
wait for dask scheduler before starting worker
cjac Jan 6, 2025
48906e1
using variable instead of my own cluster master name
cjac Jan 6, 2025
510e520
corrected syntax errors ; dump log on service failure
cjac Jan 6, 2025
9e9f872
refactored some common code ; setting default value for metadata attr…
cjac Jan 6, 2025
7480a23
added new function is_ramdisk ; keeping conda cache in its own direct…
cjac Jan 6, 2025
6b73d22
calling functions from refactored pip setup/teardown
cjac Jan 6, 2025
2e45a75
moved knox dask config to templates/dask/util_functions
cjac Jan 6, 2025
33fdd38
added copyright to templates/legal/license_header
cjac Jan 6, 2025
4f974c5
latest generated action
cjac Jan 6, 2025
75d8e32
removed redundant template disclaimer
cjac Jan 6, 2025
34fce25
setup and tear-down for actions which work with conda
cjac Jan 6, 2025
bbe062e
* refactored common conda installer functionality from dask.sh.in and
cjac Jan 6, 2025
10f1698
tested rapids.sh init action with dataproc-repro
cjac Jan 6, 2025
8a4cbd9
templates/dask/dask.sh.in,
cjac Jan 6, 2025
b01b867
refactor yarn functions into their own template
cjac Jan 7, 2025
c6c09db
refactor mig functions into their own template
cjac Jan 7, 2025
88f9f7f
state before gpu rebranch
cjac Jan 7, 2025
119f1b1
templates/common/util_functions:
cjac Jan 8, 2025
d45e16b
templates/dask/util_functions:
cjac Jan 9, 2025
a7b4707
refactored spark variable definition and reduced excess lines by bulk…
cjac Jan 9, 2025
35ca704
development on these scripts will happen in the spark-rapids-template…
cjac Jan 9, 2025
43232b2
revert dask/ to master
cjac Jan 9, 2025
4b6e520
moving that .in suffix to the correct variable
cjac Jan 8, 2025
4a024e0
reverted to master ; changes ended up in gpu-template-20250107
cjac Jan 9, 2025
f00e2f8
including libtemplate-perl as a dependency
cjac Jan 9, 2025
7118ebf
moved to dask-template-20250104
cjac Jan 9, 2025
f2b50f7
moved to gpu-template-20250107
cjac Jan 9, 2025
900c10a
* include version in template disclaimer
cjac Jan 9, 2025
bef08b1
migrated rapids.sh base template to rapids-template-20250106
cjac Jan 10, 2025
aa792c3
script to generate all actions from templates
cjac Jan 10, 2025
824bcf8
spark prepare steps belong in common
cjac Jan 10, 2025
374ff96
less noise in temp directory
cjac Jan 12, 2025
5a37d94
tested with much older versions of CUDA on an old dataproc image from…
cjac Jan 16, 2025
7662215
exercised older CUDA and mig a100 use case more ; added pytorch insta…
cjac Jan 16, 2025
0c3eb51
create function to harden sshd config ; execute it before repairing o…
cjac Jan 19, 2025
576bbb6
reviewed #1275 and brought closer to parity
cjac Jan 23, 2025
07949a9
Merge branch 'GoogleCloudDataproc:master' into template-gpu-20241219
cjac Jan 27, 2025
989b445
changes from testing PR #1275
cjac Jan 29, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion cloudbuild/Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,8 @@ RUN /usr/bin/curl -s https://bazel.build/bazel-release.pub.gpg | \
dd of="${bazel_repo_file}" status=none && \
apt-get update -qq
RUN apt-get autoremove -y -qq > /dev/null 2>&1 && \
apt-get install -y -qq default-jdk python3-setuptools bazel-${bazel_version} > /dev/null 2>&1 && \
apt-get install -y -qq default-jdk python3-setuptools bazel-${bazel_version} \
libtemplate-perl > /dev/null 2>&1 && \
apt-get clean

# Set bazel-${bazel_version} as the default bazel alternative in this container
Expand Down
9 changes: 7 additions & 2 deletions cloudbuild/presubmit.sh
Original file line number Diff line number Diff line change
Expand Up @@ -49,7 +49,12 @@ initialize_git_repo() {
determine_tests_to_run() {
# Infer the files that changed
mapfile -t DELETED_BUILD_FILES < <(git diff origin/master --name-only --diff-filter=D | grep BUILD)
mapfile -t CHANGED_FILES < <(git diff origin/master --name-only)
mapfile -t CHANGED_FILES < <(git diff origin/master --name-only | grep -v template)
for tt in $(git diff origin/master --name-only | grep 'templates/.*/.*\.sh\.in'); do
local genfile=`perl -e "print( q{${tt}} =~ m:templates/(.*?.sh).in: )"`
perl templates/generate-action.pl "${genfile}" > "${genfile}"
CHANGED_FILES+=("${genfile}")
done
echo "Deleted BUILD files: ${DELETED_BUILD_FILES[*]}"
echo "Changed files: ${CHANGED_FILES[*]}"

Expand All @@ -70,6 +75,7 @@ determine_tests_to_run() {
changed_dir="${changed_dir%%/*}/"
# Run all tests if common directories modified
if [[ ${changed_dir} =~ ^(integration_tests|util|cloudbuild)/$ ]]; then
continue
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this needs to come out before squash + merge.

echo "All tests will be run: '${changed_dir}' was changed"
TESTS_TO_RUN=(":DataprocInitActionsTestSuite")
return 0
Expand Down Expand Up @@ -104,7 +110,6 @@ run_tests() {
bazel test \
--jobs="${max_parallel_tests}" \
--local_test_jobs="${max_parallel_tests}" \
--flaky_test_attempts=3 \
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The cost of flaky test attempts > 0 is that many times the tests do not succeed on retries, and instead we just wasted the provisioning and decommissioning of clusters. Tests take longer to fail, and this extends development time.

--action_env="INTERNAL_IP_SSH=true" \
--test_output="all" \
--noshow_progress \
Expand Down
24 changes: 20 additions & 4 deletions integration_tests/dataproc_test_case.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,8 @@
import string
import subprocess
import sys
import time
import random
from threading import Timer

import pkg_resources
Expand Down Expand Up @@ -287,10 +289,24 @@ def assert_instance_command(self,
AssertionError: if command returned non-0 exit code.
"""

ret_code, stdout, stderr = self.assert_command(
'gcloud compute ssh {} --zone={} --command="{}"'.format(
instance, self.cluster_zone, cmd), timeout_in_minutes)
return ret_code, stdout, stderr
retry_count = 5

ssh_cmd='gcloud compute ssh -q {} --zone={} --command="{}" -- -o ConnectTimeout=60'.format(
instance, self.cluster_zone, cmd)

while retry_count > 0:
try:
ret_code, stdout, stderr = self.assert_command(
ssh_cmd, timeout_in_minutes )
return ret_code, stdout, stderr
except Exception as e:
print("An error occurred: ", e)
retry_count -= 1
if retry_count > 0:
time.sleep( 3 + random.randint(1, 10) )
continue
else:
raise

def assert_dataproc_job(self,
cluster_name,
Expand Down
53 changes: 53 additions & 0 deletions templates/common/install_functions
Original file line number Diff line number Diff line change
@@ -0,0 +1,53 @@
#
# Generate repo file under /etc/apt/sources.list.d/
#
function apt_add_repo() {
local -r repo_name="$1"
local -r repo_data="$3" # "http(s)://host/path/uri argument0 .. argumentN"
local -r include_src="${4:-yes}"
local -r kr_path="${5:-/usr/share/keyrings/${repo_name}.gpg}"
local -r repo_path="${6:-/etc/apt/sources.list.d/${repo_name}.list}"

echo "deb [signed-by=${kr_path}] ${repo_data}" > "${repo_path}"
if [[ "${include_src}" == "yes" ]] ; then
echo "deb-src [signed-by=${kr_path}] ${repo_data}" >> "${repo_path}"
fi

apt-get update -qq
}

#
# Generate repo file under /etc/yum.repos.d/
#
function dnf_add_repo() {
local -r repo_name="$1"
local -r repo_url="$3" # "http(s)://host/path/filename.repo"
local -r kr_path="${5:-/etc/pki/rpm-gpg/${repo_name}.gpg}"
local -r repo_path="${6:-/etc/yum.repos.d/${repo_name}.repo}"

curl -s -L "${repo_url}" \
| dd of="${repo_path}" status=progress
# | perl -p -e "s{^gpgkey=.*$}{gpgkey=file://${kr_path}}" \
}

#
# Keyrings default to
# /usr/share/keyrings/${repo_name}.gpg (debian/ubuntu) or
# /etc/pki/rpm-gpg/${repo_name}.gpg (rocky/RHEL)
#
function os_add_repo() {
local -r repo_name="$1"
local -r signing_key_url="$2"
local -r repo_data="$3" # "http(s)://host/path/uri argument0 .. argumentN"
local kr_path
if is_debuntu ; then kr_path="${5:-/usr/share/keyrings/${repo_name}.gpg}"
else kr_path="${5:-/etc/pki/rpm-gpg/${repo_name}.gpg}" ; fi

mkdir -p "$(dirname "${kr_path}")"

curl -fsS --retry-connrefused --retry 10 --retry-max-time 30 "${signing_key_url}" \
| gpg --import --no-default-keyring --keyring "${kr_path}"

if is_debuntu ; then apt_add_repo "${repo_name}" "${signing_key_url}" "${repo_data}" "${4:-yes}" "${kr_path}" "${6:-}"
else dnf_add_repo "${repo_name}" "${signing_key_url}" "${repo_data}" "${4:-yes}" "${kr_path}" "${6:-}" ; fi
}
8 changes: 8 additions & 0 deletions templates/common/template_disclaimer
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
#
# Google Cloud Dataproc Initialization Actions v[% IA_VERSION %]
#
# This initialization action is generated from
# initialization-actions/templates/[% template_path %].in
#
# Modifications made directly to generated files will be lost when the
# templates are next evaluated.
Loading