Allow billing of H100 GPUs #98

QuanMPhm · 2025-02-07T09:58:57Z

Closes #97. There are a few questions to be answered below before I turn this into a PR.

openshift_metrics/invoice.py

naved001 · 2025-03-12T18:29:04Z

openshift_metrics/invoice.py

            SU_A100_GPU: {"gpu": 1, "cpu": 24, "ram": 74},
            SU_A100_SXM4_GPU: {"gpu": 1, "cpu": 32, "ram": 245},
            SU_V100_GPU: {"gpu": 1, "cpu": 24, "ram": 192},
+            SU_H100_GPU: {"gpu": 1, "cpu": 64, "ram": 384},


@joachimweyl I just logged into the H100 machine, and I noticed that it has 256 physical cores or 512 threads. Out of which, 508 are allocate-able. So, the SU could give out way more CPUs.

@hakasapl is there an easy way to check all of the H100s and confirm that they have 512 not 256 vCPU?

I would be surprised if these machines didn't have the exact same configuration. So, 256 physical cores with each core having 2 hardware threads meaning they show up as 512 vCPUs.

@naved001 is the RAM correct?

The config is the same for all of them yes. The ram is 1.5TB

Also, can we have all of these pull from one central location? That is why we updated the nerc-doc to pull from one central location.

@joachimweyl I decided to do some tests actually, to see if I can fit 4 giant pods of these theoretical sizes, and with 127 CPUs (508/4) I could only fit 3 pods. The 4th pod didn't schedule because the node was left with 126.016 CPUs.

I then set the CPU per pod to be 126 and memory to be 375 GiB. With that I could successfully launch 4 pods and the node was completely utilized.

Allocated resources: (Total limits may be over 100 percent, i.e., overcommitted.) Resource Requests Limits -------- -------- ------ cpu 504874m (99%) 508010m (100%) memory 1539760Mi (99%) 1536500Mi (99%) ephemeral-storage 0 (0%) 0 (0%) hugepages-1Gi 0 (0%) 0 (0%) hugepages-2Mi 0 (0%) 0 (0%) devices.kubevirt.io/kvm 0 0 devices.kubevirt.io/tun 0 0 devices.kubevirt.io/vhost-net 0 0 nvidia.com/gpu 4 4

The CPU and memory are 99% utilized.

I think just to be safe we should reserve some additional amount, beyond what kubernetes already reserves. I'd say 124 CPUs and 360 Gi memory per pod. That'd leave 12 CPUs and 60GiB memory for our random little infrastructure things (@jtriley thoughts?)

I don't know if anybody did this sort of testing before we decided the size for A100.

Thank you for doing that testing, that saved us some grief in the future.

why are these numbers different than the numbers here for example A100SXM4 says 32 but on the pricing page it is 31 and the Ram says 245 vs 240.

@joachimweyl Because this PR needs to be updated to include the changes from main as the size of A100 now matches what's in the docs: https://github.com/CCI-MOC/openshift-usage-scripts/blob/main/openshift_metrics/invoice.py#L80

I will amend the SU info for H100s as well. Hopefully all of this will be resolved after #112 is complete

QuanMPhm · 2025-03-13T18:29:22Z

I have rebased in the PR and removed all TODOs. The PR is now ready for... review? Undrafted? Ready for submission? Merging? I'm not sure how I should say it.

naved001 · 2025-03-14T16:18:55Z

openshift_metrics/invoice.py

            SU_A100_GPU: {"gpu": 1, "cpu": 24, "ram": 74},
            SU_A100_SXM4_GPU: {"gpu": 1, "cpu": 31, "ram": 240},
            SU_V100_GPU: {"gpu": 1, "cpu": 48, "ram": 192},
+            SU_H100_GPU: {"gpu": 1, "cpu": 63, "ram": 376},


@QuanMPhm could you update the CPU to 124 and RAM to 360 here? Ultimately we'll get this from nerc-rates but I don't want to block this PR on that. And while we are waiting to implement #112 I still want to have some reasonable defaults.

@joachimwey Do we have an agreement on the final number of CPU and RAM for the H100 SU?

We just need to make sure the nerc-doc shows the correct numbers by the time we have H100s available on prod.

QuanMPhm · 2025-03-17T14:40:07Z

@naved001 I have amended the SU values for H100s, and so fixed the unit test so that it makes the maximum request for a H100 SU

Thanks to Naved for doing the testing to realize these values would be better for the H100 SU [1] [1] CCI-MOC/openshift-usage-scripts#98 (comment)

Thanks to Naved for doing the testing to realize these values would be better for the H100 SU [1]. It was found that a H100 could fully utilize up to 124 CPUs. [1] CCI-MOC/openshift-usage-scripts#98 (comment)

QuanMPhm requested a review from naved001 February 7, 2025 09:58

QuanMPhm commented Feb 7, 2025

View reviewed changes

openshift_metrics/invoice.py Outdated Show resolved Hide resolved

QuanMPhm commented Feb 7, 2025

View reviewed changes

openshift_metrics/invoice.py Outdated Show resolved Hide resolved

QuanMPhm commented Feb 7, 2025

View reviewed changes

openshift_metrics/invoice.py Outdated Show resolved Hide resolved

naved001 reviewed Mar 12, 2025

View reviewed changes

QuanMPhm force-pushed the 97/h100_billing branch from e0b993d to 682b268 Compare March 13, 2025 18:27

QuanMPhm marked this pull request as ready for review March 13, 2025 18:27

QuanMPhm requested review from naved001, hakasapl, joachimweyl, larsks and knikolla March 13, 2025 18:30

larsks approved these changes Mar 13, 2025

View reviewed changes

naved001 reviewed Mar 14, 2025

View reviewed changes

Allow billing of H100 GPUs

87bbd83

QuanMPhm force-pushed the 97/h100_billing branch from 682b268 to 87bbd83 Compare March 17, 2025 14:39

QuanMPhm requested a review from naved001 March 17, 2025 14:40

naved001 approved these changes Mar 17, 2025

View reviewed changes

naved001 merged commit f93520f into CCI-MOC:main Mar 17, 2025
2 checks passed

QuanMPhm added a commit to QuanMPhm/nerc-rates that referenced this pull request Apr 1, 2025

Updated H100 SU CPU and RAM info

91f4f9f

Thanks to Naved for doing the testing to realize these values would be better for the H100 SU [1] [1] CCI-MOC/openshift-usage-scripts#98 (comment)

QuanMPhm mentioned this pull request Apr 1, 2025

Updated H100 SU CPU and RAM info CCI-MOC/nerc-rates#26

Merged

Allow billing of H100 GPUs #98

Allow billing of H100 GPUs #98

Uh oh!

Conversation

QuanMPhm commented Feb 7, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

QuanMPhm commented Mar 13, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

QuanMPhm commented Mar 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

QuanMPhm commented Mar 17, 2025 •

edited

Loading