GAS documentation update

uniemimu · togashidm · commit 6dede47c3b7c · 2022-01-24T14:47:02.000Z
Documentation for the new features.

This also removes the configure-scheduler script duplication and just
suggests to use the instructions from TAS instead. The script there
has been updated for k8s 1.22+.

Signed-off-by: Ukri Niemimuukko &lt;ukri.niemimuukko@intel.com&gt;
diff --git a/gpu-aware-scheduling/README.md b/gpu-aware-scheduling/README.md
@@ -32,61 +32,9 @@ A worked example for GAS is available [here](docs/usage.md)
 The deploy folder has all of the yaml files necessary to get GPU Aware Scheduling running in a Kubernetes cluster. Some additional steps are required to configure the generic scheduler.
 
 #### Extender configuration
-Note: a shell script that shows these steps can be found [here](deploy/extender-configuration). This script should be seen as a guide only, and will not work on most Kubernetes installations.
-
-The extender configuration files can be found under deploy/extender-configuration.
-GAS Scheduler Extender needs to be registered with the Kubernetes Scheduler. In order to do this a configmap should be created like the below:
-```
-apiVersion: v1alpha1
-kind: ConfigMap
-metadata:
-  name: scheduler-extender-policy
-  namespace: kube-system
-data:
-  policy.cfg: |
-    {
-        "kind" : "Policy",
-        "apiVersion" : "v1",
-        "extenders" : [
-            {
-              "urlPrefix": "https://gpu-service.default.svc.cluster.local:9001",
-              "apiVersion": "v1",
-              "filterVerb": "scheduler/filter",
-              "bindVerb": "scheduler/bind",
-              "weight": 1,
-              "enableHttps": true,
-              "managedResources": [
-                   {
-                     "name": "gpu.intel.com/i915",
-                     "ignoredByScheduler": false
-                   }
-              ],
-              "ignorable": true,
-              "nodeCacheCapable": true
-              "tlsConfig": {
-                     "insecure": false,
-                     "certFile": "/host/certs/client.crt",
-                     "keyFile" : "/host/certs/client.key"
-              }
-          }
-         ]
-    }
-
-```
-
-A similar file can be found [in the deploy folder](./deploy/extender-configuration/scheduler-extender-configmap.yaml). This configmap can be created with ``kubectl apply -f ./deploy/scheduler-extender-configmap.yaml``
-The scheduler requires flags passed to it in order to know the location of this config map. The flags are:
-```
-    - --policy-configmap=scheduler-extender-policy
-    - --policy-configmap-namespace=kube-system
-```
-
-If scheduler is running as a service these can be added as flags to the binary. If scheduler is running as a container - as in kubeadm - these args can be passed in the deployment file.
-Note: For Kubeadm set ups some additional steps may be needed.
-1) Add the ability to get configmaps to the kubeadm scheduler config map. (A cluster role binding for this is at deploy/extender-configuration/configmap-getter.yaml)
-2) Add the ``dnsPolicy: ClusterFirstWithHostNet`` in order to access the scheduler extender by service name.
-
-After these steps the scheduler extender should be registered with the Kubernetes Scheduler.
+You should follow extender configuration instructions from the
+[Telemetry Aware Scheduling](../telemetry-aware-scheduling/README.md#Extender-configuration) and adapt those instructions to
+use GPU Aware Scheduling configurations, which can be found in the [deploy/extender-configuration](deploy/extender-configuration) folder.
 
 #### Deploy GAS
 GPU Aware Scheduling uses go modules. It requires Go 1.13+ with modules enabled in order to build. GAS has been tested with Kubernetes 1.15+.
diff --git a/gpu-aware-scheduling/deploy/extender-configuration/configure-scheduler.sh b/gpu-aware-scheduling/deploy/extender-configuration/configure-scheduler.sh
diff --git a/gpu-aware-scheduling/deploy/extender-configuration/scheduler-config-tas+gas.yaml b/gpu-aware-scheduling/deploy/extender-configuration/scheduler-config-tas+gas.yaml
@@ -0,0 +1,33 @@
+apiVersion: kubescheduler.config.k8s.io/XVERSIONX
+kind: KubeSchedulerConfiguration
+clientConnection:
+  kubeconfig: /etc/kubernetes/scheduler.conf
+extenders:
+  - urlPrefix: "https://tas-service.default.svc.cluster.local:9001"
+    prioritizeVerb: "scheduler/prioritize"
+    filterVerb: "scheduler/filter"
+    weight: 1
+    enableHTTPS: true
+    managedResources:
+      - name: "telemetry/scheduling"
+        ignoredByScheduler: true
+    ignorable: true
+    tlsConfig:
+      insecure: false
+      certFile: "/host/certs/client.crt"
+      keyFile: "/host/certs/client.key"
+  - urlPrefix: "https://gas-service.default.svc.cluster.local:9001"
+    filterVerb: "scheduler/filter"
+    bindVerb: "scheduler/bind"
+    weight: 1
+    enableHTTPS: true
+    managedResources:
+      - name: "gpu.intel.com/i915"
+        ignoredByScheduler: false
+    ignorable: true
+    tlsConfig:
+      insecure: false
+      certFile: "/host/certs/client.crt"
+      keyFile: "/host/certs/client.key"
+    nodeCacheCapable: true
+
diff --git a/gpu-aware-scheduling/deploy/extender-configuration/scheduler-config.yaml b/gpu-aware-scheduling/deploy/extender-configuration/scheduler-config.yaml
@@ -0,0 +1,20 @@
+apiVersion: kubescheduler.config.k8s.io/XVERSIONX
+kind: KubeSchedulerConfiguration
+clientConnection:
+  kubeconfig: /etc/kubernetes/scheduler.conf
+extenders:
+  - urlPrefix: "https://gas-service.default.svc.cluster.local:9001"
+    filterVerb: "scheduler/filter"
+    bindVerb: "scheduler/bind"
+    weight: 1
+    enableHTTPS: true
+    managedResources:
+      - name: "gpu.intel.com/i915"
+        ignoredByScheduler: false
+    ignorable: true
+    tlsConfig:
+      insecure: false
+      certFile: "/host/certs/client.crt"
+      keyFile: "/host/certs/client.key"
+    nodeCacheCapable: true
+
diff --git a/gpu-aware-scheduling/docs/usage.md b/gpu-aware-scheduling/docs/usage.md
@@ -65,6 +65,52 @@ Your PODs then, needs to ask for some GPU-resources. Like this:
 
 A complete example pod yaml is located in [docs/example](./example)
 
+## Node Label support
+
+GAS supports certain node labels as a means to allow telemetry based GPU selection decisions and
+descheduling of PODs using a certain GPU. You can create node labels with the
+[Telemetry Aware Scheduling](../../telemetry-aware-scheduling/README.md) labeling strategy,
+which puts them in its own namespace. In practice the supported labels need to be in the
+`telemetry.aware.scheduling.POLICYNAME/` namespace, where the POLICYNAME may be anything.
+
+The node label `gas-deschedule-pods-GPUNAME` where the GPUNAME can be e.g. card0, card1, card2...
+which corresponds to the gpu names under /dev/dri, will result in GAS labeling the PODs which
+use the named GPU with the `gpu.aware.scheduling/deschedule-pod=gpu` label. You may then
+use with a kubernetes descheduler to pick the pods for descheduling. So TAS labels the node, and
+based on the node label GAS finds and labels the PODs. Descheduler can be configured to
+deschedule the pods based on pod labels.
+
+The node label `gas-disable-GPUNAME` where the GPUNAME can be e.g. card0, card1, card2... which
+corresponds to the gpu names under /dev/dri, will result in GAS stopping the use of the named
+GPU for new allocations.
+
+The node label `gas-prefer-gpu=GPUNAME` where the GPUNAME can be e.g. card0, card1, card2...
+which corresponds to the gpu names under /dev/dri, will result in GAS trying to use the named
+GPU for new allocations before other GPUs of the same node.
+
+Note that the value of the labels starting with gas-deschedule-pods-GPUNAME and
+gas-disable-GPUNAME doesn't matter. You may use e.g. "true" as the value. The only exception to
+the rule is `PCI_GROUP` which has a special meaning, explained separately. Example:
+`gas-disable-card0=PCI_GROUP`.
+
+### PCI Groups
+
+If GAS finds a node label `gas-disable-GPUNAME=PCI_GROUP` where the GPUNAME can be e.g. card0,
+card1, card2... which corresponds to the gpu names under /dev/dri, the disabling will impact a
+group of GPUs which is defined in the node label `gpu.intel.com/pci-groups`. The syntax of the
+pci group node label is easiest to explain with an example: `gpu.intel.com/pci-groups=0.1_2.3.4`
+would indicate there are two pci-groups in the node separated with an underscore, in which card0
+and card1 form the first group, and card2, card3 and card4 form the second group. If GAS would
+find the node label `gas-disable-card3=PCI_GROUP` in a node with the previous example PCI-group
+label, GAS would stop using card2, card3 and card4 for new allocations, as card3 belongs in that
+group.
+
+`gas-deschedule-pods-GPUNAME` supports the PCI-GROUP value similarly, the whole group in which
+the named gpu belongs, will end up descheduled.
+
+The PCI group feature allows for e.g. having a telemetry action to operate on all GPUs which
+share the same physical card.
+
 ## Allowlist and Denylist
 
 You can use POD-annotations in your POD-templates to list the GPU names which you allow, or deny for your deployment. The values for the annotations are comma separated value lists of the form "card0,card1,card2", and the names of the annotations are: