-
Notifications
You must be signed in to change notification settings - Fork 6
Allow billing of H100 GPUs #98
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
openshift_metrics/invoice.py
Outdated
SU_A100_GPU: {"gpu": 1, "cpu": 24, "ram": 74}, | ||
SU_A100_SXM4_GPU: {"gpu": 1, "cpu": 32, "ram": 245}, | ||
SU_V100_GPU: {"gpu": 1, "cpu": 24, "ram": 192}, | ||
SU_H100_GPU: {"gpu": 1, "cpu": 64, "ram": 384}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@joachimweyl I just logged into the H100 machine, and I noticed that it has 256 physical cores or 512 threads. Out of which, 508 are allocate-able. So, the SU could give out way more CPUs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@hakasapl is there an easy way to check all of the H100s and confirm that they have 512 not 256 vCPU?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would be surprised if these machines didn't have the exact same configuration. So, 256 physical cores with each core having 2 hardware threads meaning they show up as 512 vCPUs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@naved001 is the RAM correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The config is the same for all of them yes. The ram is 1.5TB
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, can we have all of these pull from one central location? That is why we updated the nerc-doc to pull from one central location.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@joachimweyl I decided to do some tests actually, to see if I can fit 4 giant pods of these theoretical sizes, and with 127 CPUs (508/4) I could only fit 3 pods. The 4th pod didn't schedule because the node was left with 126.016 CPUs.
I then set the CPU per pod to be 126 and memory to be 375 GiB. With that I could successfully launch 4 pods and the node was completely utilized.
Allocated resources:
(Total limits may be over 100 percent, i.e., overcommitted.)
Resource Requests Limits
-------- -------- ------
cpu 504874m (99%) 508010m (100%)
memory 1539760Mi (99%) 1536500Mi (99%)
ephemeral-storage 0 (0%) 0 (0%)
hugepages-1Gi 0 (0%) 0 (0%)
hugepages-2Mi 0 (0%) 0 (0%)
devices.kubevirt.io/kvm 0 0
devices.kubevirt.io/tun 0 0
devices.kubevirt.io/vhost-net 0 0
nvidia.com/gpu 4 4
The CPU and memory are 99% utilized.
I think just to be safe we should reserve some additional amount, beyond what kubernetes already reserves. I'd say 124 CPUs and 360 Gi memory per pod. That'd leave 12 CPUs and 60GiB memory for our random little infrastructure things (@jtriley thoughts?)
I don't know if anybody did this sort of testing before we decided the size for A100.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for doing that testing, that saved us some grief in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why are these numbers different than the numbers here for example A100SXM4 says 32 but on the pricing page it is 31 and the Ram says 245 vs 240.
@joachimweyl Because this PR needs to be updated to include the changes from main as the size of A100 now matches what's in the docs: https://github.com/CCI-MOC/openshift-usage-scripts/blob/main/openshift_metrics/invoice.py#L80
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will amend the SU info for H100s as well. Hopefully all of this will be resolved after #112 is complete
e0b993d
to
682b268
Compare
I have rebased in the PR and removed all TODOs. The PR is now ready for... review? Undrafted? Ready for submission? Merging? I'm not sure how I should say it. |
openshift_metrics/invoice.py
Outdated
SU_A100_GPU: {"gpu": 1, "cpu": 24, "ram": 74}, | ||
SU_A100_SXM4_GPU: {"gpu": 1, "cpu": 31, "ram": 240}, | ||
SU_V100_GPU: {"gpu": 1, "cpu": 48, "ram": 192}, | ||
SU_H100_GPU: {"gpu": 1, "cpu": 63, "ram": 376}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@joachimwey Do we have an agreement on the final number of CPU and RAM for the H100 SU?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We just need to make sure the nerc-doc shows the correct numbers by the time we have H100s available on prod.
682b268
to
87bbd83
Compare
@naved001 I have amended the SU values for H100s, and so fixed the unit test so that it makes the maximum request for a H100 SU |
Thanks to Naved for doing the testing to realize these values would be better for the H100 SU [1] [1] CCI-MOC/openshift-usage-scripts#98 (comment)
Thanks to Naved for doing the testing to realize these values would be better for the H100 SU [1]. It was found that a H100 could fully utilize up to 124 CPUs. [1] CCI-MOC/openshift-usage-scripts#98 (comment)
Thanks to Naved for doing the testing to realize these values would be better for the H100 SU [1]. It was found that a H100 could fully utilize up to 124 CPUs. [1] CCI-MOC/openshift-usage-scripts#98 (comment)
Thanks to Naved for doing the testing to realize these values would be better for the H100 SU [1]. It was found that a H100 could fully utilize up to 124 CPUs. [1] CCI-MOC/openshift-usage-scripts#98 (comment)
Closes #97. There are a few questions to be answered below before I turn this into a PR.