Skip to content

Conversation

QuanMPhm
Copy link
Collaborator

@QuanMPhm QuanMPhm commented Feb 7, 2025

Closes #97. There are a few questions to be answered below before I turn this into a PR.

@QuanMPhm QuanMPhm requested a review from naved001 February 7, 2025 09:58
SU_A100_GPU: {"gpu": 1, "cpu": 24, "ram": 74},
SU_A100_SXM4_GPU: {"gpu": 1, "cpu": 32, "ram": 245},
SU_V100_GPU: {"gpu": 1, "cpu": 24, "ram": 192},
SU_H100_GPU: {"gpu": 1, "cpu": 64, "ram": 384},
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@joachimweyl I just logged into the H100 machine, and I noticed that it has 256 physical cores or 512 threads. Out of which, 508 are allocate-able. So, the SU could give out way more CPUs.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hakasapl is there an easy way to check all of the H100s and confirm that they have 512 not 256 vCPU?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would be surprised if these machines didn't have the exact same configuration. So, 256 physical cores with each core having 2 hardware threads meaning they show up as 512 vCPUs.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@naved001 is the RAM correct?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The config is the same for all of them yes. The ram is 1.5TB

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, can we have all of these pull from one central location? That is why we updated the nerc-doc to pull from one central location.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@joachimweyl I decided to do some tests actually, to see if I can fit 4 giant pods of these theoretical sizes, and with 127 CPUs (508/4) I could only fit 3 pods. The 4th pod didn't schedule because the node was left with 126.016 CPUs.

I then set the CPU per pod to be 126 and memory to be 375 GiB. With that I could successfully launch 4 pods and the node was completely utilized.

Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource                       Requests         Limits
  --------                       --------         ------
  cpu                            504874m (99%)    508010m (100%)
  memory                         1539760Mi (99%)  1536500Mi (99%)
  ephemeral-storage              0 (0%)           0 (0%)
  hugepages-1Gi                  0 (0%)           0 (0%)
  hugepages-2Mi                  0 (0%)           0 (0%)
  devices.kubevirt.io/kvm        0                0
  devices.kubevirt.io/tun        0                0
  devices.kubevirt.io/vhost-net  0                0
  nvidia.com/gpu                 4                4

The CPU and memory are 99% utilized.

I think just to be safe we should reserve some additional amount, beyond what kubernetes already reserves. I'd say 124 CPUs and 360 Gi memory per pod. That'd leave 12 CPUs and 60GiB memory for our random little infrastructure things (@jtriley thoughts?)

I don't know if anybody did this sort of testing before we decided the size for A100.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for doing that testing, that saved us some grief in the future.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are these numbers different than the numbers here for example A100SXM4 says 32 but on the pricing page it is 31 and the Ram says 245 vs 240.

@joachimweyl Because this PR needs to be updated to include the changes from main as the size of A100 now matches what's in the docs: https://github.com/CCI-MOC/openshift-usage-scripts/blob/main/openshift_metrics/invoice.py#L80

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will amend the SU info for H100s as well. Hopefully all of this will be resolved after #112 is complete

@QuanMPhm QuanMPhm marked this pull request as ready for review March 13, 2025 18:27
@QuanMPhm
Copy link
Collaborator Author

I have rebased in the PR and removed all TODOs. The PR is now ready for... review? Undrafted? Ready for submission? Merging? I'm not sure how I should say it.

SU_A100_GPU: {"gpu": 1, "cpu": 24, "ram": 74},
SU_A100_SXM4_GPU: {"gpu": 1, "cpu": 31, "ram": 240},
SU_V100_GPU: {"gpu": 1, "cpu": 48, "ram": 192},
SU_H100_GPU: {"gpu": 1, "cpu": 63, "ram": 376},
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@QuanMPhm could you update the CPU to 124 and RAM to 360 here? Ultimately we'll get this from nerc-rates but I don't want to block this PR on that. And while we are waiting to implement #112 I still want to have some reasonable defaults.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@joachimwey Do we have an agreement on the final number of CPU and RAM for the H100 SU?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We just need to make sure the nerc-doc shows the correct numbers by the time we have H100s available on prod.

@QuanMPhm
Copy link
Collaborator Author

QuanMPhm commented Mar 17, 2025

@naved001 I have amended the SU values for H100s, and so fixed the unit test so that it makes the maximum request for a H100 SU

@QuanMPhm QuanMPhm requested a review from naved001 March 17, 2025 14:40
@naved001 naved001 merged commit f93520f into CCI-MOC:main Mar 17, 2025
2 checks passed
QuanMPhm added a commit to QuanMPhm/nerc-rates that referenced this pull request Apr 1, 2025
Thanks to Naved for doing the testing to realize
these values would be better for the H100 SU [1]

[1] CCI-MOC/openshift-usage-scripts#98 (comment)
QuanMPhm added a commit to QuanMPhm/nerc-rates that referenced this pull request Apr 1, 2025
Thanks to Naved for doing the testing to realize
these values would be better for the H100 SU [1].
It was found that a H100 could fully utilize up to
124 CPUs.

[1] CCI-MOC/openshift-usage-scripts#98 (comment)
QuanMPhm added a commit to QuanMPhm/nerc-rates that referenced this pull request Apr 2, 2025
Thanks to Naved for doing the testing to realize
these values would be better for the H100 SU [1].
It was found that a H100 could fully utilize up to
124 CPUs.

[1] CCI-MOC/openshift-usage-scripts#98 (comment)
jimmysway pushed a commit to jimmysway/nerc-rates that referenced this pull request Sep 15, 2025
Thanks to Naved for doing the testing to realize
these values would be better for the H100 SU [1].
It was found that a H100 could fully utilize up to
124 CPUs.

[1] CCI-MOC/openshift-usage-scripts#98 (comment)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add proecssing for H100 SUs
5 participants