Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[template] create templates for use in generating actions #1282

Draft
wants to merge 131 commits into
base: master
Choose a base branch
from

Conversation

cjac
Copy link
Contributor

@cjac cjac commented Dec 20, 2024

This PR should resolve #1276 and is an attempt at better solving the problem space of #1030

I believe that #1259 could be implemented easier using this change, but its dependency on rebooting is antithetical to Dataproc in many ways and has not been included. I will meet with NVIDIA and the Dataproc engineering team to troubleshoot the problem.

This PR includes code refactored out of GPU-acceleration-related and dask-related actions and into files under the templates/ directory of the repository. There are a set of PRs which rebase to this branch:

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello @davorg - As I was reading through the literature, bringing myself back up to speed on the state of the art of template toolkits, I saw that there was a book with one of my friends' names on it that I had glanced at many times over the last couple of decades. I did not realize until just a few days ago that the dlc who was in charge of desk allocation in my cube farm when I started was the same dlc who wrote the book on this particular subject.

Anyway, I've been thinking of you and our peers as I've been hacking away at this installer. If you felt like looking things over and picking some nits, I'd love to hear your feedback. I hope your holidays are merry and all that!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shlomif oh, hey I see that you are actively participating in Template.pm development. I'm not doing a lot with it in this repository; everything is pretty straightforward, I think. If you had some spare time to take a peek at the new templates/ directory in this repo, and especially the templates/generate-action.pl, it might be fun to chat about it. I hope your holidays went well!

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cjac: hi! Where can I find the templates directory? Please give a url.

Copy link
Contributor Author

@cjac cjac Jan 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cjac
Copy link
Contributor Author

cjac commented Jan 2, 2025

/gcbrun

1 similar comment
@cjac
Copy link
Contributor Author

cjac commented Jan 2, 2025

/gcbrun

@cjac cjac force-pushed the template-gpu-20241219 branch from 8d28938 to e511a6e Compare January 2, 2025 20:23
@cjac
Copy link
Contributor Author

cjac commented Jan 2, 2025

/gcbrun

1 similar comment
@cjac
Copy link
Contributor Author

cjac commented Jan 2, 2025

/gcbrun

Copy link
Contributor Author

@cjac cjac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added some comments to address issues with documentation

# --metadata=ENABLE_MIG can be used to enable or disable MIG. The default is to enable it.
# The script does a reboot to fully enable MIG and then configures the MIG device based on the
# user specified MIG_CGI profiles specified via: --metadata=^:^MIG_CGI='9,9'. If MIG_CGI
# is not specified it assumes it's using an A100 and configures 2 instances with profile id 9.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/A100/H100/

#
# This script should be specified in --metadata=startup-script-url= option and
# --metadata=ENABLE_MIG can be used to enable or disable MIG. The default is to enable it.
# The script does a reboot to fully enable MIG and then configures the MIG device based on the
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does not ever reboot, and neither should you

templates/spark-rapids/mig.sh.in Outdated Show resolved Hide resolved
@cjac
Copy link
Contributor Author

cjac commented Jan 2, 2025

/gcbrun

2 similar comments
@cjac
Copy link
Contributor Author

cjac commented Jan 3, 2025

/gcbrun

@cjac
Copy link
Contributor Author

cjac commented Jan 3, 2025

/gcbrun

@cjac
Copy link
Contributor Author

cjac commented Jan 3, 2025

using the test suite I just cleaned up for #1275

@cjac
Copy link
Contributor Author

cjac commented Jan 3, 2025

/gcbrun

@cjac
Copy link
Contributor Author

cjac commented Jan 3, 2025

2.1-debian11 failure:

2025-01-03T03:18:35.157639402Z AssertionError: 1 != 0 : Failed to execute command:
2025-01-03T03:18:35.157650162Z gcloud dataproc jobs submit spark --cluster=test-gpu-standard-2-1-20250103-030909-kdee --region=us-central1 --jars=file:///usr/lib/spark/examples/jars/spark-examples.jar --class=org.apache.spark.examples.ml.JavaIndexToStringExample --properties=spark.executor.resource.gpu.amount=1,spark.executor.cores=6,spark.executor.memory=4G,spark.task.resource.gpu.amount=0.333,spark.task.cpus=2,spark.yarn.unmanagedAM.enabled=false
2025-01-03T03:18:35.157660172Z STDOUT:
2025-01-03T03:18:35.157694322Z 
2025-01-03T03:18:35.157706472Z STDERR:
2025-01-03T03:18:35.157715992Z Job [474683bad64a45e8af6cc00ccc9695ae] submitted.
2025-01-03T03:18:35.157726222Z Waiting for job output...
2025-01-03T03:18:35.157735722Z 25/01/03 03:14:42 INFO SparkEnv: Registering MapOutputTracker
2025-01-03T03:18:35.157768422Z 25/01/03 03:14:42 INFO SparkEnv: Registering BlockManagerMaster
2025-01-03T03:18:35.157778362Z 25/01/03 03:14:42 INFO SparkEnv: Registering BlockManagerMasterHeartbeat
2025-01-03T03:18:35.157787802Z 25/01/03 03:14:42 INFO SparkEnv: Registering OutputCommitCoordinator
2025-01-03T03:18:35.157797152Z 25/01/03 03:14:43 INFO DataprocSparkPlugin: Registered 128 driver metrics
2025-01-03T03:18:35.157805932Z 25/01/03 03:14:43 INFO ShimLoader: Loading shim for Spark version: 3.3.2
2025-01-03T03:18:35.157815022Z 25/01/03 03:14:43 INFO ShimLoader: Complete Spark build info: 3.3.2, https://bigdataoss-internal.googlesource.com/third_party/apache/spark, dataproc-branch-3.3.2, 5672c094ffe3ff9aa967db7b81163e1cc586a093, 2024-10-23T22:06:45Z
2025-01-03T03:18:35.157824862Z 25/01/03 03:14:43 INFO ShimLoader: findURLClassLoader found a URLClassLoader org.apache.spark.util.MutableURLClassLoader@61ab89b0
2025-01-03T03:18:35.157836082Z 25/01/03 03:14:43 INFO ShimLoader: Updating spark classloader org.apache.spark.util.MutableURLClassLoader@61ab89b0 with the URLs: jar:file:/usr/lib/spark/jars/rapids-4-spark_2.12-23.08.2.jar!/spark3xx-common/, jar:file:/usr/lib/spark/jars/rapids-4-spark_2.12-23.08.2.jar!/spark332/
2025-01-03T03:18:35.157845492Z 25/01/03 03:14:43 INFO ShimLoader: Spark classLoader org.apache.spark.util.MutableURLClassLoader@61ab89b0 updated successfully
2025-01-03T03:18:35.157869502Z 25/01/03 03:14:43 INFO ShimLoader: Updating spark classloader org.apache.spark.util.MutableURLClassLoader@61ab89b0 with the URLs: jar:file:/usr/lib/spark/jars/rapids-4-spark_2.12-23.08.2.jar!/spark3xx-common/, jar:file:/usr/lib/spark/jars/rapids-4-spark_2.12-23.08.2.jar!/spark332/
2025-01-03T03:18:35.157880132Z 25/01/03 03:14:43 INFO ShimLoader: Spark classLoader org.apache.spark.util.MutableURLClassLoader@61ab89b0 updated successfully
2025-01-03T03:18:35.157891322Z 25/01/03 03:14:43 INFO RapidsPluginUtils: RAPIDS Accelerator build: {date=2023-10-05T09:57:39Z, cudf_version=23.08.0, version=23.08.2, user=, branch=HEAD, url=https://github.com/NVIDIA/spark-rapids.git, revision=56da18a1be0148025cb00ced2ffe039fbf9c3391}
2025-01-03T03:18:35.157900352Z 25/01/03 03:14:43 INFO RapidsPluginUtils: RAPIDS Accelerator JNI build: {date=2023-08-10T03:31:37Z, version=23.08.0, user=, branch=HEAD, url=https://github.com/NVIDIA/spark-rapids-jni.git, revision=73fcd5ce22a622e5937a613bc5c4a1b32a40aec1}
2025-01-03T03:18:35.157909062Z 25/01/03 03:14:43 INFO RapidsPluginUtils: cudf build: {date=2023-08-10T03:31:37Z, version=23.08.0, user=, branch=HEAD, url=https://github.com/rapidsai/cudf.git, revision=8150d38e080c8fb021921ade83fe3aa3be04b47d}
2025-01-03T03:18:35.157917332Z 25/01/03 03:14:43 WARN RapidsPluginUtils: RAPIDS Accelerator 23.08.2 using cudf 23.08.0.
2025-01-03T03:18:35.157926612Z 25/01/03 03:14:43 WARN RapidsPluginUtils: spark.rapids.sql.multiThreadedRead.numThreads is set to 20.
2025-01-03T03:18:35.157935632Z 25/01/03 03:14:43 WARN RapidsPluginUtils: The current setting of spark.task.resource.gpu.amount (0.333) is not ideal to get the best performance from the RAPIDS Accelerator plugin. It's recommended to be 1/{executor core count} unless you have a special use case.
2025-01-03T03:18:35.157944382Z 25/01/03 03:14:43 WARN RapidsPluginUtils: RAPIDS Accelerator is enabled, to disable GPU support set `spark.rapids.sql.enabled` to false.
2025-01-03T03:18:35.157954512Z 25/01/03 03:14:43 WARN RapidsPluginUtils: spark.rapids.sql.explain is set to `NOT_ON_GPU`. Set it to 'NONE' to suppress the diagnostics logging about the query placement on the GPU.
2025-01-03T03:18:35.157963672Z 25/01/03 03:14:44 INFO DefaultNoHARMFailoverProxyProvider: Connecting to ResourceManager at test-gpu-standard-2-1-20250103-030909-kdee-m.us-central1-f.c.cloud-dataproc-ci.internal./10.128.0.50:8032
2025-01-03T03:18:35.157972112Z 25/01/03 03:14:44 INFO AHSProxy: Connecting to Application History server at test-gpu-standard-2-1-20250103-030909-kdee-m.us-central1-f.c.cloud-dataproc-ci.internal./10.128.0.50:10200
2025-01-03T03:18:35.157991762Z 25/01/03 03:14:44 INFO Configuration: found resource resource-types.xml at file:/etc/hadoop/conf.empty/resource-types.xml
2025-01-03T03:18:35.158000832Z 25/01/03 03:14:44 INFO ResourceUtils: Adding resource type - name = yarn.io/gpu, units = , type = COUNTABLE
2025-01-03T03:18:35.158009842Z 25/01/03 03:14:46 INFO YarnClientImpl: Submitted application application_1735873832026_0001
2025-01-03T03:18:35.158020202Z 25/01/03 03:14:56 INFO GoogleCloudStorageImpl: Ignoring exception of type GoogleJsonResponseException; verified object already exists with desired state.
2025-01-03T03:18:35.158029782Z 25/01/03 03:15:00 WARN GpuOverrides: 
2025-01-03T03:18:35.158038982Z !Exec <ObjectHashAggregateExec> cannot run on GPU because not all expressions can be replaced
2025-01-03T03:18:35.158050092Z   @Expression <AggregateExpression> stringindexeraggregator(org.apache.spark.ml.feature.StringIndexerAggregator@f243d5c, Some(createexternalrow(category#1.toString, StructField(category,StringType,false))), Some(interface org.apache.spark.sql.Row), Some(StructType(StructField(category,StringType,false))), encodeusingserializer(input[0, java.lang.Object, true], true), decodeusingserializer(input[0, binary, true], Array[org.apache.spark.util.collection.OpenHashMap], true), encodeusingserializer(input[0, java.lang.Object, true], true), BinaryType, true, 0, 0) could run on GPU
2025-01-03T03:18:35.158059502Z     ! <ComplexTypedAggregateExpression> StringIndexerAggregator(org.apache.spark.sql.Row) cannot run on GPU because GPU does not currently support the operator class org.apache.spark.sql.execution.aggregate.ComplexTypedAggregateExpression
2025-01-03T03:18:35.158069402Z       ! <CreateExternalRow> createexternalrow(category#1.toString, StructField(category,StringType,false)) cannot run on GPU because GPU does not currently support the operator class org.apache.spark.sql.catalyst.expressions.objects.CreateExternalRow
2025-01-03T03:18:35.158101482Z         ! <Invoke> category#1.toString cannot run on GPU because GPU does not currently support the operator class org.apache.spark.sql.catalyst.expressions.objects.Invoke
2025-01-03T03:18:35.158112172Z           @Expression <AttributeReference> category#1 could run on GPU
2025-01-03T03:18:35.158123202Z       ! <EncodeUsingSerializer> encodeusingserializer(input[0, java.lang.Object, true], true) cannot run on GPU because GPU does not currently support the operator class org.apache.spark.sql.catalyst.expressions.objects.EncodeUsingSerializer
2025-01-03T03:18:35.158131912Z         ! <BoundReference> input[0, java.lang.Object, true] cannot run on GPU because GPU does not currently support the operator class org.apache.spark.sql.catalyst.expressions.BoundReference
2025-01-03T03:18:35.158140402Z       ! <DecodeUsingSerializer> decodeusingserializer(input[0, binary, true], Array[org.apache.spark.util.collection.OpenHashMap], true) cannot run on GPU because GPU does not currently support the operator class org.apache.spark.sql.catalyst.expressions.objects.DecodeUsingSerializer
2025-01-03T03:18:35.158148742Z         ! <BoundReference> input[0, binary, true] cannot run on GPU because GPU does not currently support the operator class org.apache.spark.sql.catalyst.expressions.BoundReference
2025-01-03T03:18:35.158158062Z       ! <EncodeUsingSerializer> encodeusingserializer(input[0, java.lang.Object, true], true) cannot run on GPU because GPU does not currently support the operator class org.apache.spark.sql.catalyst.expressions.objects.EncodeUsingSerializer
2025-01-03T03:18:35.158166272Z         ! <BoundReference> input[0, java.lang.Object, true] cannot run on GPU because GPU does not currently support the operator class org.apache.spark.sql.catalyst.expressions.BoundReference
2025-01-03T03:18:35.158186702Z   @Expression <AttributeReference> StringIndexerAggregator(org.apache.spark.sql.Row)#14 could run on GPU
2025-01-03T03:18:35.158195922Z   @Expression <Alias> StringIndexerAggregator(org.apache.spark.sql.Row)#14 AS StringIndexerAggregator(org.apache.spark.sql.Row)#15 could run on GPU
2025-01-03T03:18:35.158204532Z     @Expression <AttributeReference> StringIndexerAggregator(org.apache.spark.sql.Row)#14 could run on GPU
2025-01-03T03:18:35.158213622Z   !Exec <ShuffleExchangeExec> cannot run on GPU because Columnar exchange without columnar children is inefficient
2025-01-03T03:18:35.158222172Z     @Partitioning <SinglePartition$> could run on GPU
2025-01-03T03:18:35.158230522Z     !Exec <ObjectHashAggregateExec> cannot run on GPU because not all expressions can be replaced
2025-01-03T03:18:35.158239022Z       @Expression <AggregateExpression> partial_stringindexeraggregator(org.apache.spark.ml.feature.StringIndexerAggregator@f243d5c, Some(createexternalrow(category#1.toString, StructField(category,StringType,false))), Some(interface org.apache.spark.sql.Row), Some(StructType(StructField(category,StringType,false))), encodeusingserializer(input[0, java.lang.Object, true], true), decodeusingserializer(input[0, binary, true], Array[org.apache.spark.util.collection.OpenHashMap], true), encodeusingserializer(input[0, java.lang.Object, true], true), BinaryType, true, 0, 0) could run on GPU
2025-01-03T03:18:35.158247572Z         ! <ComplexTypedAggregateExpression> StringIndexerAggregator(org.apache.spark.sql.Row) cannot run on GPU because GPU does not currently support the operator class org.apache.spark.sql.execution.aggregate.ComplexTypedAggregateExpression
2025-01-03T03:18:35.158256122Z           ! <CreateExternalRow> createexternalrow(category#1.toString, StructField(category,StringType,false)) cannot run on GPU because GPU does not currently support the operator class org.apache.spark.sql.catalyst.expressions.objects.CreateExternalRow
2025-01-03T03:18:35.158265132Z             ! <Invoke> category#1.toString cannot run on GPU because GPU does not currently support the operator class org.apache.spark.sql.catalyst.expressions.objects.Invoke
2025-01-03T03:18:35.158284682Z               @Expression <AttributeReference> category#1 could run on GPU
2025-01-03T03:18:35.158293312Z           ! <EncodeUsingSerializer> encodeusingserializer(input[0, java.lang.Object, true], true) cannot run on GPU because GPU does not currently support the operator class org.apache.spark.sql.catalyst.expressions.objects.EncodeUsingSerializer
2025-01-03T03:18:35.158302412Z             ! <BoundReference> input[0, java.lang.Object, true] cannot run on GPU because GPU does not currently support the operator class org.apache.spark.sql.catalyst.expressions.BoundReference
2025-01-03T03:18:35.158311202Z           ! <DecodeUsingSerializer> decodeusingserializer(input[0, binary, true], Array[org.apache.spark.util.collection.OpenHashMap], true) cannot run on GPU because GPU does not currently support the operator class org.apache.spark.sql.catalyst.expressions.objects.DecodeUsingSerializer
2025-01-03T03:18:35.158327352Z             ! <BoundReference> input[0, binary, true] cannot run on GPU because GPU does not currently support the operator class org.apache.spark.sql.catalyst.expressions.BoundReference
2025-01-03T03:18:35.158336902Z           ! <EncodeUsingSerializer> encodeusingserializer(input[0, java.lang.Object, true], true) cannot run on GPU because GPU does not currently support the operator class org.apache.spark.sql.catalyst.expressions.objects.EncodeUsingSerializer
2025-01-03T03:18:35.158345412Z             ! <BoundReference> input[0, java.lang.Object, true] cannot run on GPU because GPU does not currently support the operator class org.apache.spark.sql.catalyst.expressions.BoundReference
2025-01-03T03:18:35.158353802Z       @Expression <AttributeReference> buf#19 could run on GPU
2025-01-03T03:18:35.158372552Z       @Expression <AttributeReference> buf#20 could run on GPU
2025-01-03T03:18:35.158381252Z       ! <LocalTableScanExec> cannot run on GPU because GPU does not currently support the operator class org.apache.spark.sql.execution.LocalTableScanExec
2025-01-03T03:18:35.158390802Z         @Expression <AttributeReference> category#1 could run on GPU
2025-01-03T03:18:35.158399782Z 
2025-01-03T03:18:35.158408922Z 25/01/03 03:15:00 INFO GpuOverrides: Plan conversion to the GPU took 82.60 ms
2025-01-03T03:18:35.158418282Z 25/01/03 03:15:00 WARN GpuOverrides: 
2025-01-03T03:18:35.158427332Z !Exec <ObjectHashAggregateExec> cannot run on GPU because not all expressions can be replaced
2025-01-03T03:18:35.158436242Z   @Expression <AggregateExpression> stringindexeraggregator(org.apache.spark.ml.feature.StringIndexerAggregator@f243d5c, Some(createexternalrow(category#1.toString, StructField(category,StringType,false))), Some(interface org.apache.spark.sql.Row), Some(StructType(StructField(category,StringType,false))), encodeusingserializer(input[0, java.lang.Object, true], true), decodeusingserializer(input[0, binary, true], Array[org.apache.spark.util.collection.OpenHashMap], true), encodeusingserializer(input[0, java.lang.Object, true], true), BinaryType, true, 0, 0) could run on GPU
2025-01-03T03:18:35.158444812Z     ! <ComplexTypedAggregateExpression> StringIndexerAggregator(org.apache.spark.sql.Row) cannot run on GPU because GPU does not currently support the operator class org.apache.spark.sql.execution.aggregate.ComplexTypedAggregateExpression
2025-01-03T03:18:35.158453752Z       ! <CreateExternalRow> createexternalrow(category#1.toString, StructField(category,StringType,false)) cannot run on GPU because GPU does not currently support the operator class org.apache.spark.sql.catalyst.expressions.objects.CreateExternalRow
2025-01-03T03:18:35.158462122Z         ! <Invoke> category#1.toString cannot run on GPU because GPU does not currently support the operator class org.apache.spark.sql.catalyst.expressions.objects.Invoke
2025-01-03T03:18:35.158470662Z           @Expression <AttributeReference> category#1 could run on GPU
2025-01-03T03:18:35.158478612Z       ! <EncodeUsingSerializer> encodeusingserializer(input[0, java.lang.Object, true], true) cannot run on GPU because GPU does not currently support the operator class org.apache.spark.sql.catalyst.expressions.objects.EncodeUsingSerializer
2025-01-03T03:18:35.158487022Z         ! <BoundReference> input[0, java.lang.Object, true] cannot run on GPU because GPU does not currently support the operator class org.apache.spark.sql.catalyst.expressions.BoundReference
2025-01-03T03:18:35.158495642Z       ! <DecodeUsingSerializer> decodeusingserializer(input[0, binary, true], Array[org.apache.spark.util.collection.OpenHashMap], true) cannot run on GPU because GPU does not currently support the operator class org.apache.spark.sql.catalyst.expressions.objects.DecodeUsingSerializer
2025-01-03T03:18:35.158504672Z         ! <BoundReference> input[0, binary, true] cannot run on GPU because GPU does not currently support the operator class org.apache.spark.sql.catalyst.expressions.BoundReference
2025-01-03T03:18:35.158513622Z       ! <EncodeUsingSerializer> encodeusingserializer(input[0, java.lang.Object, true], true) cannot run on GPU because GPU does not currently support the operator class org.apache.spark.sql.catalyst.expressions.objects.EncodeUsingSerializer
2025-01-03T03:18:35.158522272Z         ! <BoundReference> input[0, java.lang.Object, true] cannot run on GPU because GPU does not currently support the operator class org.apache.spark.sql.catalyst.expressions.BoundReference
2025-01-03T03:18:35.158531532Z   @Expression <AttributeReference> StringIndexerAggregator(org.apache.spark.sql.Row)#14 could run on GPU
2025-01-03T03:18:35.158540392Z   @Expression <Alias> StringIndexerAggregator(org.apache.spark.sql.Row)#14 AS StringIndexerAggregator(org.apache.spark.sql.Row)#15 could run on GPU
2025-01-03T03:18:35.158563802Z     @Expression <AttributeReference> StringIndexerAggregator(org.apache.spark.sql.Row)#14 could run on GPU
2025-01-03T03:18:35.158572982Z   !Exec <ShuffleExchangeExec> cannot run on GPU because Columnar exchange without columnar children is inefficient
2025-01-03T03:18:35.158581902Z     @Partitioning <SinglePartition$> could run on GPU
2025-01-03T03:18:35.158591402Z     !Exec <ObjectHashAggregateExec> cannot run on GPU because not all expressions can be replaced
2025-01-03T03:18:35.158600182Z       @Expression <AggregateExpression> partial_stringindexeraggregator(org.apache.spark.ml.feature.StringIndexerAggregator@f243d5c, Some(createexternalrow(category#1.toString, StructField(category,StringType,false))), Some(interface org.apache.spark.sql.Row), Some(StructType(StructField(category,StringType,false))), encodeusingserializer(input[0, java.lang.Object, true], true), decodeusingserializer(input[0, binary, true], Array[org.apache.spark.util.collection.OpenHashMap], true), encodeusingserializer(input[0, java.lang.Object, true], true), BinaryType, true, 0, 0) could run on GPU
2025-01-03T03:18:35.158609412Z         ! <ComplexTypedAggregateExpression> StringIndexerAggregator(org.apache.spark.sql.Row) cannot run on GPU because GPU does not currently support the operator class org.apache.spark.sql.execution.aggregate.ComplexTypedAggregateExpression
2025-01-03T03:18:35.158617432Z           ! <CreateExternalRow> createexternalrow(category#1.toString, StructField(category,StringType,false)) cannot run on GPU because GPU does not currently support the operator class org.apache.spark.sql.catalyst.expressions.objects.CreateExternalRow
2025-01-03T03:18:35.158626352Z             ! <Invoke> category#1.toString cannot run on GPU because GPU does not currently support the operator class org.apache.spark.sql.catalyst.expressions.objects.Invoke
2025-01-03T03:18:35.158635312Z               @Expression <AttributeReference> category#1 could run on GPU
2025-01-03T03:18:35.158644272Z           ! <EncodeUsingSerializer> encodeusingserializer(input[0, java.lang.Object, true], true) cannot run on GPU because GPU does not currently support the operator class org.apache.spark.sql.catalyst.expressions.objects.EncodeUsingSerializer
2025-01-03T03:18:35.158653102Z             ! <BoundReference> input[0, java.lang.Object, true] cannot run on GPU because GPU does not currently support the operator class org.apache.spark.sql.catalyst.expressions.BoundReference
2025-01-03T03:18:35.158661752Z           ! <DecodeUsingSerializer> decodeusingserializer(input[0, binary, true], Array[org.apache.spark.util.collection.OpenHashMap], true) cannot run on GPU because GPU does not currently support the operator class org.apache.spark.sql.catalyst.expressions.objects.DecodeUsingSerializer
2025-01-03T03:18:35.158690872Z             ! <BoundReference> input[0, binary, true] cannot run on GPU because GPU does not currently support the operator class org.apache.spark.sql.catalyst.expressions.BoundReference
2025-01-03T03:18:35.158701652Z           ! <EncodeUsingSerializer> encodeusingserializer(input[0, java.lang.Object, true], true) cannot run on GPU because GPU does not currently support the operator class org.apache.spark.sql.catalyst.expressions.objects.EncodeUsingSerializer
2025-01-03T03:18:35.158710832Z             ! <BoundReference> input[0, java.lang.Object, true] cannot run on GPU because GPU does not currently support the operator class org.apache.spark.sql.catalyst.expressions.BoundReference
2025-01-03T03:18:35.158719952Z       @Expression <AttributeReference> buf#19 could run on GPU
2025-01-03T03:18:35.158729072Z       @Expression <AttributeReference> buf#20 could run on GPU
2025-01-03T03:18:35.158738432Z       ! <LocalTableScanExec> cannot run on GPU because GPU does not currently support the operator class org.apache.spark.sql.execution.LocalTableScanExec
2025-01-03T03:18:35.158759192Z         @Expression <AttributeReference> category#1 could run on GPU
2025-01-03T03:18:35.158768182Z 
2025-01-03T03:18:35.158777272Z 25/01/03 03:15:00 INFO GpuOverrides: Plan conversion to the GPU took 7.28 ms
2025-01-03T03:18:35.158786472Z 25/01/03 03:15:01 INFO GpuOverrides: Plan conversion to the GPU took 1.26 ms
2025-01-03T03:18:35.158795052Z 25/01/03 03:15:01 WARN GpuOverrides: 
2025-01-03T03:18:35.158804352Z !Exec <ObjectHashAggregateExec> cannot run on GPU because not all expressions can be replaced
2025-01-03T03:18:35.158827882Z   @Expression <AggregateExpression> stringindexeraggregator(org.apache.spark.ml.feature.StringIndexerAggregator@f243d5c, Some(createexternalrow(category#1.toString, StructField(category,StringType,false))), Some(interface org.apache.spark.sql.Row), Some(StructType(StructField(category,StringType,false))), encodeusingserializer(input[0, java.lang.Object, true], true), decodeusingserializer(input[0, binary, true], Array[org.apache.spark.util.collection.OpenHashMap], true), encodeusingserializer(input[0, java.lang.Object, true], true), BinaryType, true, 0, 0) could run on GPU
2025-01-03T03:18:35.158838842Z     ! <ComplexTypedAggregateExpression> StringIndexerAggregator(org.apache.spark.sql.Row) cannot run on GPU because GPU does not currently support the operator class org.apache.spark.sql.execution.aggregate.ComplexTypedAggregateExpression
2025-01-03T03:18:35.158848372Z       ! <CreateExternalRow> createexternalrow(category#1.toString, StructField(category,StringType,false)) cannot run on GPU because GPU does not currently support the operator class org.apache.spark.sql.catalyst.expressions.objects.CreateExternalRow
2025-01-03T03:18:35.158857722Z         ! <Invoke> category#1.toString cannot run on GPU because GPU does not currently support the operator class org.apache.spark.sql.catalyst.expressions.objects.Invoke
2025-01-03T03:18:35.158867352Z           @Expression <AttributeReference> category#1 could run on GPU
2025-01-03T03:18:35.158876662Z       ! <EncodeUsingSerializer> encodeusingserializer(input[0, java.lang.Object, true], true) cannot run on GPU because GPU does not currently support the operator class org.apache.spark.sql.catalyst.expressions.objects.EncodeUsingSerializer
2025-01-03T03:18:35.158888172Z         ! <BoundReference> input[0, java.lang.Object, true] cannot run on GPU because GPU does not currently support the operator class org.apache.spark.sql.catalyst.expressions.BoundReference
2025-01-03T03:18:35.158897632Z       ! <DecodeUsingSerializer> decodeusingserializer(input[0, binary, true], Array[org.apache.spark.util.collection.OpenHashMap], true) cannot run on GPU because GPU does not currently support the operator class org.apache.spark.sql.catalyst.expressions.objects.DecodeUsingSerializer
2025-01-03T03:18:35.158906542Z         ! <BoundReference> input[0, binary, true] cannot run on GPU because GPU does not currently support the operator class org.apache.spark.sql.catalyst.expressions.BoundReference
2025-01-03T03:18:35.158915172Z       ! <EncodeUsingSerializer> encodeusingserializer(input[0, java.lang.Object, true], true) cannot run on GPU because GPU does not currently support the operator class org.apache.spark.sql.catalyst.expressions.objects.EncodeUsingSerializer
2025-01-03T03:18:35.158923792Z         ! <BoundReference> input[0, java.lang.Object, true] cannot run on GPU because GPU does not currently support the operator class org.apache.spark.sql.catalyst.expressions.BoundReference
2025-01-03T03:18:35.158932872Z   @Expression <AttributeReference> StringIndexerAggregator(org.apache.spark.sql.Row)#14 could run on GPU
2025-01-03T03:18:35.158941622Z   @Expression <Alias> StringIndexerAggregator(org.apache.spark.sql.Row)#14 AS StringIndexerAggregator(org.apache.spark.sql.Row)#15 could run on GPU
2025-01-03T03:18:35.158962062Z     @Expression <AttributeReference> StringIndexerAggregator(org.apache.spark.sql.Row)#14 could run on GPU
2025-01-03T03:18:35.158971042Z   !Exec <ShuffleExchangeExec> cannot run on GPU because Columnar exchange without columnar children is inefficient
2025-01-03T03:18:35.158980162Z     @Partitioning <SinglePartition$> could run on GPU
2025-01-03T03:18:35.158988692Z     !Exec <ObjectHashAggregateExec> cannot run on GPU because not all expressions can be replaced
2025-01-03T03:18:35.158997872Z       @Expression <AggregateExpression> partial_stringindexeraggregator(org.apache.spark.ml.feature.StringIndexerAggregator@f243d5c, Some(createexternalrow(category#1.toString, StructField(category,StringType,false))), Some(interface org.apache.spark.sql.Row), Some(StructType(StructField(category,StringType,false))), encodeusingserializer(input[0, java.lang.Object, true], true), decodeusingserializer(input[0, binary, true], Array[org.apache.spark.util.collection.OpenHashMap], true), encodeusingserializer(input[0, java.lang.Object, true], true), BinaryType, true, 0, 0) could run on GPU
2025-01-03T03:18:35.159030592Z         ! <ComplexTypedAggregateExpression> StringIndexerAggregator(org.apache.spark.sql.Row) cannot run on GPU because GPU does not currently support the operator class org.apache.spark.sql.execution.aggregate.ComplexTypedAggregateExpression
2025-01-03T03:18:35.159043982Z           ! <CreateExternalRow> createexternalrow(category#1.toString, StructField(category,StringType,false)) cannot run on GPU because GPU does not currently support the operator class org.apache.spark.sql.catalyst.expressions.objects.CreateExternalRow
2025-01-03T03:18:35.159053702Z             ! <Invoke> category#1.toString cannot run on GPU because GPU does not currently support the operator class org.apache.spark.sql.catalyst.expressions.objects.Invoke
2025-01-03T03:18:35.159062732Z               @Expression <AttributeReference> category#1 could run on GPU
2025-01-03T03:18:35.159081082Z           ! <EncodeUsingSerializer> encodeusingserializer(input[0, java.lang.Object, true], true) cannot run on GPU because GPU does not currently support the operator class org.apache.spark.sql.catalyst.expressions.objects.EncodeUsingSerializer
2025-01-03T03:18:35.159091482Z             ! <BoundReference> input[0, java.lang.Object, true] cannot run on GPU because GPU does not currently support the operator class org.apache.spark.sql.catalyst.expressions.BoundReference
2025-01-03T03:18:35.159100782Z           ! <DecodeUsingSerializer> decodeusingserializer(input[0, binary, true], Array[org.apache.spark.util.collection.OpenHashMap], true) cannot run on GPU because GPU does not currently support the operator class org.apache.spark.sql.catalyst.expressions.objects.DecodeUsingSerializer
2025-01-03T03:18:35.159109962Z             ! <BoundReference> input[0, binary, true] cannot run on GPU because GPU does not currently support the operator class org.apache.spark.sql.catalyst.expressions.BoundReference
2025-01-03T03:18:35.159119532Z           ! <EncodeUsingSerializer> encodeusingserializer(input[0, java.lang.Object, true], true) cannot run on GPU because GPU does not currently support the operator class org.apache.spark.sql.catalyst.expressions.objects.EncodeUsingSerializer
2025-01-03T03:18:35.159128692Z             ! <BoundReference> input[0, java.lang.Object, true] cannot run on GPU because GPU does not currently support the operator class org.apache.spark.sql.catalyst.expressions.BoundReference
2025-01-03T03:18:35.159138382Z       @Expression <AttributeReference> buf#19 could run on GPU
2025-01-03T03:18:35.159147352Z       @Expression <AttributeReference> buf#20 could run on GPU
2025-01-03T03:18:35.159156532Z       ! <LocalTableScanExec> cannot run on GPU because GPU does not currently support the operator class org.apache.spark.sql.execution.LocalTableScanExec
2025-01-03T03:18:35.159177462Z         @Expression <AttributeReference> category#1 could run on GPU
2025-01-03T03:18:35.159186652Z 
2025-01-03T03:18:35.159195722Z 25/01/03 03:15:01 INFO GpuOverrides: Plan conversion to the GPU took 4.75 ms
2025-01-03T03:18:35.159204872Z 25/01/03 03:15:01 INFO GpuOverrides: GPU plan transition optimization took 13.66 ms
2025-01-03T03:18:35.159214022Z 25/01/03 03:15:01 WARN GpuOverrides: 
2025-01-03T03:18:35.159223582Z !Exec <ShuffleExchangeExec> cannot run on GPU because Columnar exchange without columnar children is inefficient
2025-01-03T03:18:35.159233342Z   @Partitioning <SinglePartition$> could run on GPU
2025-01-03T03:18:35.159242602Z   !Exec <ObjectHashAggregateExec> cannot run on GPU because not all expressions can be replaced
2025-01-03T03:18:35.159252392Z     @Expression <AggregateExpression> partial_stringindexeraggregator(org.apache.spark.ml.feature.StringIndexerAggregator@f243d5c, Some(createexternalrow(category#1.toString, StructField(category,StringType,false))), Some(interface org.apache.spark.sql.Row), Some(StructType(StructField(category,StringType,false))), encodeusingserializer(input[0, java.lang.Object, true], true), decodeusingserializer(input[0, binary, true], Array[org.apache.spark.util.collection.OpenHashMap], true), encodeusingserializer(input[0, java.lang.Object, true], true), BinaryType, true, 0, 0) could run on GPU
2025-01-03T03:18:35.159261702Z       ! <ComplexTypedAggregateExpression> StringIndexerAggregator(org.apache.spark.sql.Row) cannot run on GPU because GPU does not currently support the operator class org.apache.spark.sql.execution.aggregate.ComplexTypedAggregateExpression
2025-01-03T03:18:35.159282182Z         ! <CreateExternalRow> createexternalrow(category#1.toString, StructField(category,StringType,false)) cannot run on GPU because GPU does not currently support the operator class org.apache.spark.sql.catalyst.expressions.objects.CreateExternalRow
2025-01-03T03:18:35.159293432Z           ! <Invoke> category#1.toString cannot run on GPU because GPU does not currently support the operator class org.apache.spark.sql.catalyst.expressions.objects.Invoke
2025-01-03T03:18:35.159302532Z             @Expression <AttributeReference> category#1 could run on GPU
2025-01-03T03:18:35.159311742Z         ! <EncodeUsingSerializer> encodeusingserializer(input[0, java.lang.Object, true], true) cannot run on GPU because GPU does not currently support the operator class org.apache.spark.sql.catalyst.expressions.objects.EncodeUsingSerializer
2025-01-03T03:18:35.159321662Z           ! <BoundReference> input[0, java.lang.Object, true] cannot run on GPU because GPU does not currently support the operator class org.apache.spark.sql.catalyst.expressions.BoundReference
2025-01-03T03:18:35.159333852Z         ! <DecodeUsingSerializer> decodeusingserializer(input[0, binary, true], Array[org.apache.spark.util.collection.OpenHashMap], true) cannot run on GPU because GPU does not currently support the operator class org.apache.spark.sql.catalyst.expressions.objects.DecodeUsingSerializer
2025-01-03T03:18:35.159343432Z           ! <BoundReference> input[0, binary, true] cannot run on GPU because GPU does not currently support the operator class org.apache.spark.sql.catalyst.expressions.BoundReference
2025-01-03T03:18:35.159353252Z         ! <EncodeUsingSerializer> encodeusingserializer(input[0, java.lang.Object, true], true) cannot run on GPU because GPU does not currently support the operator class org.apache.spark.sql.catalyst.expressions.objects.EncodeUsingSerializer
2025-01-03T03:18:35.159362232Z           ! <BoundReference> input[0, java.lang.Object, true] cannot run on GPU because GPU does not currently support the operator class org.apache.spark.sql.catalyst.expressions.BoundReference
2025-01-03T03:18:35.159371352Z     @Expression <AttributeReference> buf#19 could run on GPU
2025-01-03T03:18:35.159390812Z     @Expression <AttributeReference> buf#20 could run on GPU
2025-01-03T03:18:35.159400042Z     ! <LocalTableScanExec> cannot run on GPU because GPU does not currently support the operator class org.apache.spark.sql.execution.LocalTableScanExec
2025-01-03T03:18:35.159408912Z       @Expression <AttributeReference> category#1 could run on GPU
2025-01-03T03:18:35.159417562Z 
2025-01-03T03:18:35.159426412Z 25/01/03 03:15:01 INFO GpuOverrides: Plan conversion to the GPU took 4.15 ms
2025-01-03T03:18:35.159435922Z 25/01/03 03:15:01 INFO GpuOverrides: GPU plan transition optimization took 7.25 ms
2025-01-03T03:18:35.159446272Z 25/01/03 03:15:43 WARN YarnSchedulerBackend$YarnSchedulerEndpoint: Requesting driver to remove executor 2 for reason Container from a bad node: container_1735873832026_0001_01_000003 on host: test-gpu-standard-2-1-20250103-030909-kdee-w-0.us-central1-f.c.cloud-dataproc-ci.internal. Exit status: 1. Diagnostics: [2025-01-03 03:15:43.085]Exception from container-launch.
2025-01-03T03:18:35.159455422Z Container id: container_1735873832026_0001_01_000003
2025-01-03T03:18:35.159464442Z Exit code: 1
2025-01-03T03:18:35.159472832Z Exception message: Launch container failed
2025-01-03T03:18:35.159481212Z Shell error output: Nonzero exit code=1, error message='Invalid argument number'

@cjac
Copy link
Contributor Author

cjac commented Jan 4, 2025

/gcbrun

3 similar comments
@cjac
Copy link
Contributor Author

cjac commented Jan 4, 2025

/gcbrun

@cjac
Copy link
Contributor Author

cjac commented Jan 4, 2025

/gcbrun

@cjac
Copy link
Contributor Author

cjac commented Jan 4, 2025

/gcbrun

@cjac
Copy link
Contributor Author

cjac commented Jan 4, 2025

well that's good news, then.

@cjac
Copy link
Contributor Author

cjac commented Jan 4, 2025

/gcbrun

@cjac cjac marked this pull request as ready for review January 4, 2025 07:59
@cjac
Copy link
Contributor Author

cjac commented Jan 4, 2025

/gcbrun

@cjac cjac force-pushed the template-gpu-20241219 branch from 763e1ff to 900c10a Compare January 9, 2025 22:03
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dlc - Can I get a review of the templates/ directory in this repository, please? I tried to keep it simple for the initial implementation, but if you have any advice about how we can further reduce duplication, I'd be all ears. I'm thinking about picking up your book and getting into the minutia, but the PR will be closed far before then, I hope!

@cjac cjac force-pushed the template-gpu-20241219 branch from cea2aa3 to 2afff45 Compare January 10, 2025 03:20
@cjac cjac force-pushed the template-gpu-20241219 branch from 2afff45 to aa792c3 Compare January 10, 2025 03:23
@cjac
Copy link
Contributor Author

cjac commented Jan 11, 2025

What changes will be required in the steps to create a cluster with init scripts using templates after this PR?

Currently, the steps include:
--initialization-actions=gs://goog-dataproc-initialization-actions-${REGION}/spark-rapids/spark-rapids.sh

Referencing the example: Create a Dataproc cluster using T4s.

Update: I just learnt about [spark-rapids] generate spark-rapids/spark-rapids.sh from template. I assume we programmatically regenerate spark-rapids.sh whenever a change is made to the template.

Those instructions seem right, but it may take less time with the new versions, since much of the work is now cached, and when the memory is sufficient, installation utilizes ram disks.

Maybe mention that with new custom images, secure boot can be enabled. The new custom image script requires that the secret manager api service be enabled for the project.

And yes, new actions will be generated from templates on each, now versioned, release.

@cjac cjac force-pushed the template-gpu-20241219 branch from effd8b5 to 374ff96 Compare January 12, 2025 03:55
@cjac cjac force-pushed the template-gpu-20241219 branch from 76a6df6 to 0c3eb51 Compare January 20, 2025 00:51
cjac added a commit to LLC-Technologies-Collier/initialization-actions that referenced this pull request Jan 23, 2025
cjac added a commit to LLC-Technologies-Collier/initialization-actions that referenced this pull request Jan 23, 2025
cjac added a commit that referenced this pull request Feb 6, 2025
* [gpu] toward a more consistent driver and CUDA install

gpu/install_gpu_driver.sh
  * exclusively using .run file installation method when available
  * build nccl from source
  * cache build artifacts from kernel driver and nccl
  * Tested more CUDA minor versions
  * gathering CUDA and driver version from URLs if passed
  * Printing warnings when combination provided is known to fail
  * waiting on apt lock when it exists
  * wrapping expensive functions in completion checks to reduce re-run time
  * fixed a problem with ops agent not installing ; using venv
  * Installing gcc-12 on ubuntu22 to fix kernel driver FTBFS
  * setting better spark defaults
  * skipping proxy setup if http-proxy metadata not set
  * added function to check secure-boot and os version compatability

gpu/manual-test-runner.sh
  * order commands correctly

gpu/test_gpu.py
  * clearer test skipping logic
  * added instructions on how to test pyspark

* correcting driver for cuda 12.4

* correcting cuda subversion.  12.4.0 instead of 12.4.1 so that driver and cuda match up

* corrected cannonical 11.8 driver version ; removed extra code and comment ; added better description of what is in the runfile

* skipping most tests ; using 11.7 from the cuda 11 line instead of the less well supported 11.8

* verified that the cuda and driver versions match up

* reducing log capture

* temporarily increasing machine shape for build caching

* 64 is too many for a single T4

* added a subversion for 11.7

* add more tests to the install function

* only including architectures supported by this version of CUDA

* pinning down versions better ; more caching ; more ram disks ; new pytorch and tensorflow test functions

* using maximum from 8.9 series on rocky for 11.7

* skip full build

* pinning to bazel-7.4.0

* NCCL requires gcc-11 for cuda11

* rocky8 is now building from the source in the .run file

* reverting to previous state of only selecting a compiler version on latest releases

* replaced literal path names with variable values ; indexing builds by the signing key used

* moved variable definition to prepare function ; moved driver signing to build phase

* test whether variable is defined before checking its value

* cache only the bins and logs

* build index of kernel modules after unpacking ; remove call to non-existent function

* only build module dependency index once

* skipping CUDA 11 NCCL build on debian12

* skip cuda11 on debian12, rocky9

* renamed verify_pyspark to verify_instance_pyspark

* failing somewhat gracefully ; skipping tests that would fail

* skipping single node tests for rocky8

* re-enable other tests

* Specifying bazel version with variable

* fixing up some skip logic

* replaced OS_NAME with _shortname

* skip more single instance tests for rocky8

* fixing indentation ; skipping redundant test

* remove retries of flakey tests

* oops ; need to define the cuda version to test for

* passing -q to gcloud to generate empty passphrase if no ssh key exists ; selecting a more modern version of the 550 driver

* including instructions on how to create a secure-boot key pair

* -e for expert, not -p for pro

* updated 11.8 and 12.0 driver versions

* added a signature check test which allows granular selection of platform to test, but does not yet verify signatures

* tuning the layout of arguments to userspace.run

* scoping DEFAULT_CUDA_VERSION correctly ; exercising rocky including kerberos on 12.6

* add a connect timeout to the ssh call instead of trying to patch around a longer than expected connection delay

* add some entropy to the process

* perhaps a re-run would have fixed 2.0-rocky8 on that last run

* increasing init action timeout to account for uncached builds

* cache non-open kernel build results

* per-kernel sub-directory for kmod tarballs

* using upstream repo and branch

* corrected grammar error

* testing Kerberos some more

* better implementation of numa node selection

* this time with a test which is exercised

* skip debian11 on Kerberos

* also skipping 2.1-ubuntu20 on kerberos clusters

* re-adjusting tests to be performed ; adjusting rather than skipping known failure cases

* more temporal variance

* skipping CUDA=12.0 for ubuntu22

* kerberos not known to succeed on 2.0-rocky8

* 2.2 dataproc images do not support CUDA <= 12.0

* skipping SINGLE configuration for rocky8 again

* not testing 2.0

* trying without test retries ; retries should happen within the test, not by re-running the test

* kerberos only works on 2.2

* using expectedFailure instead of skipTest for tests which are known to fail

* document one of the failure states

* skipping expected failures

* updated manual-test-runner.sh instructions

* this one generated from template after refactor

* do not point to local rpm pgp key

* re-ordering to reduce delta from master

* custom image usage can come later

* see #1283

* replaced incorrectly removed presubmit.sh and removed custom image key creation script intended to be removed in 70f37b6

* revert nearly to master

* can include extended test suite later

* order commands correctly

* placing all completion files in a common directory

* extend supported version list to include latest release of each minor version and their associated driver

* tested with CUDA 11.6.2/510.108.03

* nccl build completes successfully on debian10

* account for nvidia-smi ABI change post 11.6

* exercised with cuda 11.1

* cleaned up nccl build and pack code a bit
* no longer installing cudnn from local debian repo
* unpacking nccl from cache immediately rather than waiting until
  later in the code
* determine cudnn version by what is available in the repo
* less noise from apt-mark hold
* nccl build tested on 11.1 and 11.6
* account for abi change in nvidia-smi

* reverting cloudbuild/Dockerfile to master

* nvidia is 404ing for download.nvidia.com ; using us.download.nvidia.com

* skipping rocky9

* * adding version 12.6 to the support matrix
* changing layout of gcs package folder
* install_pytorch function created and called when cuDNN is being installed

* incorrect version check removed

* only install pytorch if include-pytorch metadata set to true

* since call to install_pytorch is protected by metadata check, skip metadata check within the function ; create new function harden_sshd_config and call it

* increasing timeout and machine shape to reduce no-cache build time

* skip full test run due to edits to integration_tests directory

* ubuntu18 does not know about kex-gss ; use correct driver version number for cuda 11.1.1 url generation

* on rocky9 sshd service is called sshd instead of ssh as the rest of the platforms call it

* kex-gss is new in debian11

* all rocky call it sshd it seems

* cudnn no longer available on debian10

* compared with #1282 ; this change matches parity more closely

* slightly better variable declaration ordering ; it is better still in the templates/ directory from #1282

* install spark rapids

* cache the results of nvidia-smi --query-gpu

* reduce development time

* exercising more CUDA variants ; testing whether tests fail on long runs

* try to reduce concurrent builds ; extend build time further ; only enable spark rapids on images >= 2.1

* fixed bug with spark rapids version assignment ; more conservative about requirements for ramdisk ; roll back spark.SQLPlugin change

* * gpu does not work on capacity scheduler on dataproc 2.0 ; use fair
* protect against race condition on removing the .building files
* add logic for pre-11.7 cuda package repo back in
* clean up and verify yarn config

* revert test_install_gpu_cuda_nvidia_with_spark_job cuda versions

* configure for use with JupyterLab

* 2.2 should use 12.6.3 (latest)

* Addressing review from cnauroth

gpu/install_gpu_driver.sh:
* use the same retry arguments in all calls to curl
* correct 12.3's driver and sub-version
* improve logic for pause as other workers perform build
* remove call to undefined clear_nvsmi_cache
* move closing "fi" to line of its own
* added comments for unclear logic
* removed commented code
* remove unused curl for latest driver version

gpu/test_gpu.py
* removed excess test
* added comment about numa node selection
* removed skips of rocky9 ; 2.2.44-rocky9 build succeeds

* reverting changes to presubmit.sh
@cjac
Copy link
Contributor Author

cjac commented Feb 6, 2025

TODO: apply these changes from gpu driver installer into common templates:

e56ddd0

@cjac cjac marked this pull request as draft February 6, 2025 19:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[initialization-actions] The repository has manually-generated, re-used code which gets out of sync
4 participants