Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

update rapids version for 24.10 release #1248

Merged
merged 10 commits into from
Dec 25, 2024

Conversation

nvliyuan
Copy link
Contributor

This pr is to update the spark-rapids script version to 24.10.0 and update the readme doc

@nvliyuan
Copy link
Contributor Author

@viadea please help review. CC @jayadeep-jayaraman @cjac

@cjac
Copy link
Contributor

cjac commented Oct 24, 2024

Oh hey, thanks for the ping. I'll check it out.

@cjac
Copy link
Contributor

cjac commented Oct 24, 2024

/gcbrun

@cjac cjac marked this pull request as draft October 24, 2024 04:26
Copy link
Contributor

@cjac cjac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's get the docs updated to reflect the versions of Dataproc supported in Q4 2024

Our current supported versions follow:
2.2-debian12
2.1-debian11
2.0-debian10
2.2-ubuntu22
2.2-ubuntu20
2.0-ubuntu18
2.2-rocky9
2.1-rocky8
2.0-rocky8

And if all of these platforms are not supported by spark-rapids, then we'll need to merge in my changes to dask-rapids.

* NCCL 2.11.4+
* Ubuntu 18.04, Ubuntu 20.04 or Rocky Linux 7, Rocky Linux8, Debian 10, Debian 11
* Ubuntu 20.04, Ubuntu 22.04, CentOS 7, or Rocky Linux 8, Debian 10, Debian 11
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have support for CentOS 7, stil, really? I commend you. Do the tests exercise that platform?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would also add Rocky Linux 9.

If you do not have support for rocky linux 9 yet, we can merge my work from spark-dask. I'll continue my review.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for pointing it out, we don't support centos7 now, update a link in the doc for the soft/hardware requirements.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

File related pr for the doc issue.

@@ -63,7 +63,7 @@ export CUDA_VER=11.5

gcloud dataproc clusters create $CLUSTER_NAME \
--region $REGION \
--image-version=2.0-ubuntu18 \
--image-version=2.1-ubuntu20 \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please recommend 2.2 where possible. If 2.1 doesn't have other representation, then this is fine, but we should be emphasizing that 2.2 is the better choice.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

update to 2.2 image

@nvliyuan
Copy link
Contributor Author

nvliyuan commented Nov 5, 2024

Hi @cjac , can we merge this pr?

@nvliyuan
Copy link
Contributor Author

Hi @cjac , any update?

@cjac
Copy link
Contributor

cjac commented Nov 11, 2024

I apologize for the delay here.

I'm caught up behind adding installation from local disk as an option to rapids/rapids.sh ; I had begun seeing weekly cdn related build failures, so I'm bringing the packages closer to the cluster to improve ci/cd test performance.

Unfortunately, conda does not presently install directly from direct attached media, opting instead to copy the packages to an intermediate temp directory before unpacking.

conda/conda#14377

If movement on spark-rapids is urgent and merits putting down dask rapids instead of finishing it, putting it down, and moving on to spark-rapids work, I may be able to switch context. I prefer to finish the other first, but if nv wants to see a new version of spark-rapids before the middle of December, then let me know and I'll switch tracks for a bit.

My current estimate for completion of dask-rapids work is later this week. Then I will take a look at spark-rapids/ for the first time since it got its own directory.

C.J.

@nvliyuan
Copy link
Contributor Author

Hi @cjac , not sure that can we merge this pr now?

@cjac
Copy link
Contributor

cjac commented Dec 24, 2024

I haven't tested it thoroughly yet. I've been caught up in refactoring shared code into templates. I would like to generate this file from components rather than copy/pasting between scripts.

Can you let me know what you think of #1282 please?

@nvliyuan
Copy link
Contributor Author

Let's merge the pr for now, it is just a version update, thx

@cjac
Copy link
Contributor

cjac commented Dec 25, 2024

let me try it in my environment...

@cjac
Copy link
Contributor

cjac commented Dec 25, 2024

@cjac
Copy link
Contributor

cjac commented Dec 25, 2024

incorrectly specifying rapids-runtime as --metadata rapids-runtime="RAPIDS" produces usable error message.

@cjac
Copy link
Contributor

cjac commented Dec 25, 2024

@nvliyuan - do you mind if I commit to your branch?

@nvliyuan
Copy link
Contributor Author

please feel free to commit, thanks

@cjac
Copy link
Contributor

cjac commented Dec 25, 2024

a re-re-re-run took 4m35.536s to complete ; This looks good to me. Let me run it through the automated tests.

It looks like it built the kernel more than once.

NG code in templates/spark-rapids/ caches builds to GCS after the first run completes so subsequent similar runs will have less work to do.

[1] https://github.com/LLC-Technologies-Collier/initialization-actions/tree/template-gpu-20241219/templates/spark-rapids

@cjac
Copy link
Contributor

cjac commented Dec 25, 2024

/gcbrun

@cjac
Copy link
Contributor

cjac commented Dec 25, 2024

there's a known problem with our build system. Un momento por favor.

@cjac
Copy link
Contributor

cjac commented Dec 25, 2024

/gcbrun

1 similar comment
@cjac
Copy link
Contributor

cjac commented Dec 25, 2024

/gcbrun

@cjac
Copy link
Contributor

cjac commented Dec 25, 2024

failure on 2.2-rocky9 ; I'll spin that up in my env.

@cjac
Copy link
Contributor

cjac commented Dec 25, 2024

oof. I forgot DKMS took so long to run. I think the test is timing out for waiting on dnf -y -q module install nvidia-driver:latest-dkms maybe

@cjac
Copy link
Contributor

cjac commented Dec 25, 2024

dnf -y -q install cuda-toolkit is taking a long time, too

@cjac
Copy link
Contributor

cjac commented Dec 25, 2024

the run takes 14m9.444s on rocky9

# echo $?
0

@cjac
Copy link
Contributor

cjac commented Dec 25, 2024

/gcbrun

@cjac
Copy link
Contributor

cjac commented Dec 25, 2024

[edited to add: I was incorrect to assume that nvliyuan/initialization-actions' master tracks GoogleCloudDataproc/initialization-actions' master]

I'm sorry, I seem to have done something to the commit history here. The diffstat looks very wrong at this point.

$ git diff master | diffstat
 CONTRIBUTING.md                         |   20 ++---
 cloudbuild/Dockerfile                   |   22 +++++-
 cloudbuild/presubmit.sh                 |    1 
 cloudbuild/run-presubmit-on-k8s.sh      |   34 +++++++--
 dask/dask.sh                            |  213 ++++++++++++++++++++++++++++++++++++++++++++--------------
 dask/test_dask.py                       |   14 +++
 gpu/Dockerfile                          |   40 +++++++++++
 gpu/README.md                           |   28 +++----
 gpu/bazel.screenrc                      |   11 +++
 gpu/env.json.sample                     |    7 +
 gpu/install_gpu_driver.sh               |  652 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++---------------------------------------------------------
 gpu/manual-test-runner.sh               |   77 +++++++++++++++++++++
 gpu/run-bazel-tests.sh                  |   24 ++++++
 gpu/test_gpu.py                         |  167 ++++++++++++++++++++++++++++++++-------------
 h2o/sample-script.py                    |   11 ---
 horovod/horovod.sh                      |    4 -
 horovod/test_horovod.py                 |   13 ++-
 hue/README.md                           |  118 ++++++++++++++++++++++++++++++++
 hue/another-query.png                   |binary
 hue/create-hive-table.png               |binary
 hue/hue-ui.png                          |binary
 hue/simple-hiveql.png                   |binary
 integration_tests/dataproc_test_case.py |   18 +++-
 rapids/BUILD                            |    2 
 rapids/Dockerfile                       |   40 +++++++++++
 rapids/bazel.screenrc                   |   17 ++++
 rapids/env.json.sample                  |    7 +
 rapids/manual-test-runner.sh            |   77 +++++++++++++++++++++
 rapids/rapids.sh                        |  814 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++-----------------------------------------------------------
 rapids/run-bazel-tests.sh               |   23 ++++++
 rapids/test_rapids.py                   |  137 +++++++++++--------------------------
 rapids/verify_rapids_dask.py            |   19 -----
 rapids/verify_rapids_dask_yarn.py       |   19 +++++
 spark-rapids/README.md                  |   19 -----
 spark-rapids/spark-rapids.sh            |   14 ++-
 35 files changed, 1943 insertions(+), 719 deletions(-)

@cjac
Copy link
Contributor

cjac commented Dec 25, 2024

my apologies. ambiguous use of 'master' here. When diffed against origin master's commit, 169e98e I see what I expect.

@cjac cjac force-pushed the rapids-v2410 branch 2 times, most recently from a3e5e99 to ab83665 Compare December 25, 2024 04:16
@cjac
Copy link
Contributor

cjac commented Dec 25, 2024

/gcbrun

@cjac
Copy link
Contributor

cjac commented Dec 25, 2024

many tests have passed. still standing by for full green run.

@cjac
Copy link
Contributor

cjac commented Dec 25, 2024

okay, that looks good.

@cjac cjac self-requested a review December 25, 2024 04:53
Copy link
Contributor

@cjac cjac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@cjac cjac marked this pull request as ready for review December 25, 2024 04:56
@cjac cjac merged commit 8089389 into GoogleCloudDataproc:master Dec 25, 2024
1 of 2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants