Skip to content

Conversation

@Steboss
Copy link
Contributor

@Steboss Steboss commented Nov 24, 2025

No description provided.

Copy link
Contributor

@yhtang yhtang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for making this! Made some comments. Let me know what you think.


# Configuration
NAMESPACE="${NAMESPACE:-default}"
JOBSET_NAME="jax-vllm-multinode"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have these variables defined twice in separate files. Is there anyway that we can provide a single source of truth to avoid unintentional errors (e.g. edited in one place but not another)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure thing Yu-Hang. Consider this file just for a local development. I'll change this and have a clear definition in the CI

effect: NoSchedule
containers:
- name: gateway-container
image: 941377147396.dkr.ecr.us-east-1.amazonaws.com/sbosisio/jio:jax-k8s
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a note that we need to turn it into a placeholder and set it dynamically for each job in the final production workflow.

echo "Gateway URL: ${GATEWAY_URL}"
echo "Ray Head IP: ${RAY_HEAD_IP}"

# 1. Wait for gateway to be ready
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Noting that in the long run this could be integrated into the bridge.

@Steboss how long could the gap be, between the start of the gateway and the application (jax/vLLM) pods? Can we make the launch of the application pods dependent on the gateway pod?

#NCCL
# - name: NCCL_DEBUG
# value: "INFO" # Change to WARN after debugging
- name: NCCL_PROTO
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NCCL's official guidelines are to avoid setting this variable explicitly whenever possible. Is this mandated by AWS?

Copy link
Contributor

@yhtang yhtang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that we have a working example, could we integrate this into jio.yaml?

Also, could we add a performance-monitoring step to this job so that if throughput drops below a certain baseline, the job reports a failure?

@Steboss
Copy link
Contributor Author

Steboss commented Dec 2, 2025

hey @yhtang

Also, could we add a performance-monitoring step to this job so that if throughput drops below a certain baseline, the job reports a failure?

We can definitely make up the fully working example, just a caveat, we're still investigating why NCCL doesn't pick up EFA on EKS. If it's ok with you we can start with this approach, performance will be low, I'll give you some numbers by EOW at most.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants