Skip to content

feat: Added an update to CoreDNS #501

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 13 commits into
base: develop
Choose a base branch
from

Conversation

DannyLiCom
Copy link
Collaborator

Fixes / Features

Enhance DNS Scalability for Large-Scale Testing:

  • Issue: During large-scale load testing, the existing Kube-DNS solution was found to be insufficient in supporting the demands of McJAX and Pathways TPU paths, leading to potential performance bottlenecks.
  • Solution: Adjusted configurations to default to CoreDNS for McJAX and Pathways TPU paths.

Testing / Documentation

When using the command python3 xpk.py cluster create-pathways to create a cluster, CoreDNS will be used by default. For the general command python3 xpk.py cluster create, kube-dns is still the default.

  • [ y/n ] Tests pass
  • [ y/n ] Appropriate changes to documentation are included in the PR

… create-pathways is used, it will default to CoreDNS.
Copy link

google-cla bot commented Jun 16, 2025

Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

View this failed invocation of the CLA check for more information.

For the most up to date status, view the checks section at the bottom of the pull request.

@ycchenzheng
Copy link
Collaborator

@SujeethJinesh

Copy link
Collaborator

@SujeethJinesh SujeethJinesh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Danny!

Copy link
Collaborator

@SujeethJinesh SujeethJinesh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Danny!

I also think one more thing we should do is add a unit test for this addition. Though it seems like there are very few set up standards for this. Maybe @pawloch00 or others can chime in on where best to create the unit tests.

https://github.com/AI-Hypercomputer/xpk/tree/develop/src/xpk/core/tests/unit

@pawloch00
Copy link
Collaborator

Did testing scenarions included only TPU clusters? What with clusters created with GPUs, especially A3 Mega or ultra, that are created using cluster toolkit?

Copy link
Collaborator

@SujeethJinesh SujeethJinesh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM from my perspective, but I'm not an XPK owner, so best to get verification from @Obliviour or @pawloch00

@DannyLiCom
Copy link
Collaborator Author

@pawloch00 please review this PR again

@pawloch00
Copy link
Collaborator

Please ignore integration tests failing for now. I will try to fix it and let you know

@DannyLiCom
Copy link
Collaborator Author

Okay, Thanks!

@pawloch00
Copy link
Collaborator

Okay, Thanks!

Please accept collaborator invitation and rerun the tests

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants