Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

update gang-scheduling benchmark #92

Merged
merged 1 commit into from
Aug 10, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
35 changes: 35 additions & 0 deletions resources/benchmarks/gang-scheduling/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
# Gang Scheduling Benchmark Test

This directory contains gang scheduling benchmark tests for the following workload managers and schedulers:

- Jobset
- Kueue
- Volcano
- Yunikorn
- Run:ai

The gang-scheduling benchmark workflow operates on 32 virtual GPU nodes, submitting a burst of 53 jobs with replica numbers ranging from 1 to 32 in a [predetermined order](workflows/run-test-common.yml).

The workload is designed to fully utilize the cluster under optimal scheduling conditions.

One method to perform benchmarking is to input this workload into clusters that use different schedulers and then compare the average GPU occupancy of the nodes.

## Usage

For all workload managers except Run:ai, the benchmark test involves two sequential workflows. The first workflow registers the CRDs, and the second workflow runs the common part of the test.

### Example

To run the benchmark test for Kueue:

```bash
./bin/knavigator -workflow resources/benchmarks/gang-scheduling/workflows/config-kueue.yml,resources/benchmarks/gang-scheduling/workflows/run-test-common.yml
```

### Run:ai

Run:ai requires additional customization and thus has a separate workflow:

```bash
./bin/knavigator -workflow resources/benchmarks/gang-scheduling/workflows/run-test-runai.yml
```
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
name: config-jobset
tasks:
- id: register
type: RegisterObj
params:
template: "resources/benchmarks/templates/k8s/jobset.yml"
nameFormat: "jobset{{._ENUM_}}"
podNameFormat: "{{._NAME_}}-workers-[0-9]+-[0-9]+-.+"
podCount: "{{.replicas}}"
50 changes: 50 additions & 0 deletions resources/benchmarks/gang-scheduling/workflows/config-kueue.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,50 @@
name: config-kueue
tasks:
- id: register-cluster-queue
type: RegisterObj
params:
template: "resources/templates/kueue/cluster-queue.yml"
- id: register-local-queue
type: RegisterObj
params:
template: "resources/templates/kueue/local-queue.yml"
- id: register-resource-flavor
type: RegisterObj
params:
template: "resources/templates/kueue/resource-flavor.yml"
- id: register
type: RegisterObj
params:
template: "resources/benchmarks/templates/kueue/job.yml"
nameFormat: "job{{._ENUM_}}"
podNameFormat: "{{._NAME_}}-[0-9]-.*"
podCount: "{{.replicas}}"
- id: create-resource-flavor
type: SubmitObj
params:
refTaskId: register-resource-flavor
canExist: true
params:
name: "gpu-node"
nodeLabels:
nvidia.com/gpu.count: "8"
- id: create-cluster-queue
type: SubmitObj
params:
refTaskId: register-cluster-queue
canExist: true
params:
name: team
flavor: gpu-node
cpu: 8
memory: 36Gi
gpu: 256
- id: create-local-queue
type: SubmitObj
params:
refTaskId: register-local-queue
canExist: true
params:
name: team-queue
namespace: default
clusterQueue: team
13 changes: 0 additions & 13 deletions resources/benchmarks/gang-scheduling/workflows/config-nodes.yml

This file was deleted.

Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
name: config-volcano
tasks:
- id: register
type: RegisterObj
params:
template: "resources/benchmarks/templates/volcano/job.yml"
nameFormat: "j{{._ENUM_}}"
podNameFormat: "{{._NAME_}}-test-[0-9]+"
podCount: "{{.replicas}}"
- id: configure
type: Configure
params:
configmaps:
- name: volcano-scheduler-configmap
namespace: volcano-system
op: create
data:
volcano-scheduler.conf: |
actions: "enqueue, allocate, backfill"
tiers:
- plugins:
- name: priority
- name: gang
- name: conformance
- plugins:
- name: drf
- name: predicates
- name: proportion
- name: nodeorder
- name: binpack
timeout: 1m
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
name: config-yunikorn
tasks:
- id: register
type: RegisterObj
params:
template: "resources/benchmarks/templates/yunikorn/job.yml"
nameFormat: "job{{._ENUM_}}"
podNameFormat: "{{._NAME_}}-.*"
podCount: "{{.replicas}}"
- id: configure
type: Configure
params:
configmaps:
- name: yunikorn-configs
namespace: yunikorn
op: create
data:
queues.yaml: |
partitions:
- name: default
queues:
- name: root
queues:
- name: sandbox
submitacl: '*'
resources:
max:
{memory: 36Gi, vcore: 8000m, nvidia.com/gpu: 256}
timeout: 1m
135 changes: 135 additions & 0 deletions resources/benchmarks/gang-scheduling/workflows/run-test-common.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,135 @@
name: test-gang-scheduling
tasks:
- id: configure
type: Configure
params:
nodes:
- type: dgxa100.80g
count: 32
labels:
nvidia.com/gpu.count: "8"
timeout: 1m
- id: sleep
type: Sleep
params:
timeout: 5s
- id: job1
type: SubmitObj
params:
refTaskId: register
count: 1
params:
replicas: 32
ttl: 30s
- id: job2
type: SubmitObj
params:
refTaskId: register
count: 2
params:
replicas: 16
ttl: 30s
- id: job3
type: SubmitObj
params:
refTaskId: register
count: 3
params:
replicas: 10
ttl: 30s
- id: job3.1
type: SubmitObj
params:
refTaskId: register
count: 1
params:
replicas: 2
ttl: 30s
- id: job4
type: SubmitObj
params:
refTaskId: register
count: 4
params:
replicas: 8
ttl: 30s
- id: job5
type: SubmitObj
params:
refTaskId: register
count: 5
params:
replicas: 6
ttl: 30s
- id: job5.1
type: SubmitObj
params:
refTaskId: register
count: 2
params:
replicas: 1
ttl: 30s
- id: job6
type: SubmitObj
params:
refTaskId: register
count: 6
params:
replicas: 5
ttl: 30s
- id: job6.1
type: SubmitObj
params:
refTaskId: register
count: 1
params:
replicas: 2
ttl: 30s
- id: job7
type: SubmitObj
params:
refTaskId: register
count: 7
params:
replicas: 4
ttl: 30s
- id: job7.1
type: SubmitObj
params:
refTaskId: register
count: 1
params:
replicas: 2
ttl: 30s
- id: job7.2
type: SubmitObj
params:
refTaskId: register
count: 2
params:
replicas: 1
ttl: 30s
- id: job8
type: SubmitObj
params:
refTaskId: register
count: 8
params:
replicas: 4
ttl: 30s
- id: job9
type: SubmitObj
params:
refTaskId: register
count: 9
params:
replicas: 3
ttl: 30s
- id: job9.1
type: SubmitObj
params:
refTaskId: register
count: 1
params:
replicas: 5
ttl: 30s
Original file line number Diff line number Diff line change
@@ -1,5 +1,14 @@
name: test-gang-scheduling
name: test-gang-scheduling-runai
tasks:
- id: configure
type: Configure
params:
nodes:
- type: dgxa100.80g
count: 32
labels:
nvidia.com/gpu.count: "8"
timeout: 1m
- id: register-trainingworkload
type: RegisterObj
params:
Expand Down
Loading
Loading