-
Notifications
You must be signed in to change notification settings - Fork 512
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
update rapids version for 24.10 release #1248
Conversation
@viadea please help review. CC @jayadeep-jayaraman @cjac |
Oh hey, thanks for the ping. I'll check it out. |
/gcbrun |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's get the docs updated to reflect the versions of Dataproc supported in Q4 2024
Our current supported versions follow:
2.2-debian12
2.1-debian11
2.0-debian10
2.2-ubuntu22
2.2-ubuntu20
2.0-ubuntu18
2.2-rocky9
2.1-rocky8
2.0-rocky8
And if all of these platforms are not supported by spark-rapids, then we'll need to merge in my changes to dask-rapids.
spark-rapids/README.md
Outdated
* NCCL 2.11.4+ | ||
* Ubuntu 18.04, Ubuntu 20.04 or Rocky Linux 7, Rocky Linux8, Debian 10, Debian 11 | ||
* Ubuntu 20.04, Ubuntu 22.04, CentOS 7, or Rocky Linux 8, Debian 10, Debian 11 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you have support for CentOS 7, stil, really? I commend you. Do the tests exercise that platform?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would also add Rocky Linux 9.
If you do not have support for rocky linux 9 yet, we can merge my work from spark-dask. I'll continue my review.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for pointing it out, we don't support centos7 now, update a link in the doc for the soft/hardware requirements.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
File related pr for the doc issue.
spark-rapids/README.md
Outdated
@@ -63,7 +63,7 @@ export CUDA_VER=11.5 | |||
|
|||
gcloud dataproc clusters create $CLUSTER_NAME \ | |||
--region $REGION \ | |||
--image-version=2.0-ubuntu18 \ | |||
--image-version=2.1-ubuntu20 \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please recommend 2.2 where possible. If 2.1 doesn't have other representation, then this is fine, but we should be emphasizing that 2.2 is the better choice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
update to 2.2 image
Hi @cjac , can we merge this pr? |
Hi @cjac , any update? |
I apologize for the delay here. I'm caught up behind adding installation from local disk as an option to rapids/rapids.sh ; I had begun seeing weekly cdn related build failures, so I'm bringing the packages closer to the cluster to improve ci/cd test performance. Unfortunately, conda does not presently install directly from direct attached media, opting instead to copy the packages to an intermediate temp directory before unpacking. If movement on spark-rapids is urgent and merits putting down dask rapids instead of finishing it, putting it down, and moving on to spark-rapids work, I may be able to switch context. I prefer to finish the other first, but if nv wants to see a new version of spark-rapids before the middle of December, then let me know and I'll switch tracks for a bit. My current estimate for completion of dask-rapids work is later this week. Then I will take a look at spark-rapids/ for the first time since it got its own directory. C.J. |
Hi @cjac , not sure that can we merge this pr now? |
I haven't tested it thoroughly yet. I've been caught up in refactoring shared code into templates. I would like to generate this file from components rather than copy/pasting between scripts. Can you let me know what you think of #1282 please? |
Let's merge the pr for now, it is just a version update, thx |
let me try it in my environment... |
running without cuda-version specified produces a request for: |
incorrectly specifying rapids-runtime as |
@nvliyuan - do you mind if I commit to your branch? |
please feel free to commit, thanks |
a re-re-re-run took 4m35.536s to complete ; This looks good to me. Let me run it through the automated tests. It looks like it built the kernel more than once. NG code in templates/spark-rapids/ caches builds to GCS after the first run completes so subsequent similar runs will have less work to do. |
/gcbrun |
there's a known problem with our build system. Un momento por favor. |
/gcbrun |
1 similar comment
/gcbrun |
failure on 2.2-rocky9 ; I'll spin that up in my env. |
oof. I forgot DKMS took so long to run. I think the test is timing out for waiting on |
|
the run takes 14m9.444s on rocky9
|
/gcbrun |
[edited to add: I was incorrect to assume that nvliyuan/initialization-actions' master tracks GoogleCloudDataproc/initialization-actions' master] I'm sorry, I seem to have done something to the commit history here. The diffstat looks very wrong at this point.
|
my apologies. ambiguous use of 'master' here. When diffed against origin master's commit, 169e98e I see what I expect. |
Signed-off-by: liyuan <[email protected]>
Signed-off-by: liyuan <[email protected]>
Signed-off-by: liyuan <[email protected]>
Signed-off-by: liyuan <[email protected]>
a3e5e99
to
ab83665
Compare
/gcbrun |
many tests have passed. still standing by for full green run. |
okay, that looks good. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
This pr is to update the spark-rapids script version to 24.10.0 and update the readme doc