You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository was archived by the owner on Jan 29, 2025. It is now read-only.
This commit will:
* Via @dorney99. update collectd and stress-ng image
* adding 30 second delay using init container
* Update autoscaler to use new version of API
* Update security settings for container and resource limits
* Trivy fix for hostPort and doc updates
* Update docs with latest changes
---------
Co-authored-by: Denisio Togashi <[email protected]>
Co-authored-by: Aaron Dorney <[email protected]>
Co-authored-by: Madalina Lazar <[email protected]>
Using the above commands both of these images will be built and then pushed to the local docker registry preparing them for deployment on Kubernetes.
38
-
39
-
### 2: Deploy collectd on Kubernetes
40
-
Collectd can now be deployed on our Kubernetes cluster. Note that the collectd image is configured using a configmap and can be reconfigured by changing that file - located in `collectd/configmap.yaml`. In our default configuration collectd will only export stats on the node CPU package power. This is powered by the [Intel Comms Power Management collectd plugin](https://github.com/intel/CommsPowerManagement/tree/master/telemetry)
27
+
### 1: Deploy collectd on Kubernetes
28
+
Collectd can now be deployed on our Kubernetes cluster. Note that the collectd image is configured using a configmap and can be reconfigured by changing that file - located in `collectd/configmap.yaml`. In our default configuration collectd will only export stats on the node CPU package power. This is powered by the [Intel Comms Power Management collectd plugin](https://github.com/intel/CommsPowerManagement/tree/master/telemetry). The collectd [intel/observability-collectd image](https://hub.docker.com/r/intel/observability-collectd) will need an extra python script for reading and exporting power values which can be grabbed with a curl command and will need to be added as a configmap.
The agent should soon be up and running on each node in the cluster. This can be checked by running:
47
37
48
38
`kubectl get pods -nmonitoring -lapp.kubernetes.io/name=collectd`
49
39
50
-
51
-
Additionally we can check that collectd is indeed serving metrics by running `curl localhost:9103` on any node where the agent is running. The output should resemble the below:
40
+
Additionally we can check that collectd is indeed serving metrics by running `collectd_ip=$(kubectl get svc -n monitoring -o json | jq -r '.items[] | select(.metadata.name=="collectd").spec.clusterIP'); curl -x "" $collectd_ip:9103/metrics`. The output should look similar to the lines below:
# collectd/write_prometheus 5.11.0.70.gd4c3c59 at localhost
68
56
```
69
57
70
-
### 3: Install Kube State Metrics
58
+
### 2: Install Kube State Metrics
71
59
72
60
Kube State Metrics reads information from a Kubernetes cluster and makes it available to Prometheus. In our use case we're looking for basic information about the pods currently running on the cluster. Kube state metrics can be installed using helm, the Kubernetes package manager, which comes preinstalled in the BMRA.
@@ -82,16 +70,121 @@ kubectl get pods -l app.kubernetes.io/name=kube-state-metrics
82
70
``
83
71
84
72
### 3: Configure Prometheus and the Prometheus adapter
85
-
Now that we have our metrics collections agents up and running, Prometheus and Prometheus Adapter, which makes Prometheus metrics available inside the cluster to Telemetry Aware Scheduling, need to be configured to scrape metrics from collectd and Kube State Metrics and to calculate the compound power metrics linking pods to power usage.
Now that we have our metrics collections agents up and running, Prometheus and Prometheus Adapter (it makes Prometheus metrics available inside the cluster to Telemetry Aware Scheduling) need to be configured to scrape metrics from collectd and Kube State Metrics and to calculate the compound power metrics linking pods to power usage.
75
+
76
+
77
+
#### 3.1: Manually set-up and install new configuration
78
+
79
+
The above command creates a new configuration and reboots Prometheus and the Prometheus Adapter. Collectd power metrics should now be available in the Prometheus database and, more importantly, inside the Kubernetes cluster itself.
#### 3.2: Helm and updating an existing configuration
85
+
86
+
#### 3.2.1: Prometheus
87
+
88
+
To add a new metrics into Prometheus, update [prometheus-config-map.yaml](https://github.com/intel/platform-aware-scheduling/blob/master/telemetry-aware-scheduling/deploy/charts/prometheus_helm_chart/templates/prometheus-config-map.yaml) in a similar manner to the below:
89
+
90
+
1. Create a metric and add it to a Prometheus rule file
expr: kube_pod_info * on (node) group_left node_package_power_avg
100
+
```
101
+
102
+
2. Add the rule file to Prometheus (if adding a completely new rule file). If the rule file has already been done (i.e. ***/etc/prometheus/prometheus.rules***) move to step 3
103
+
```
104
+
rule_files:
105
+
- /etc/prometheus/prometheus.rules
106
+
- recording_rules.yml
107
+
```
108
+
3. Inform prometheus to scrape kube-state-metrics and collectd and how to do do it
To check if the new metrics were scraped correctly you can look for the following in the Prometheus UI:
136
+
* In the "Targets"/ "Service Discovery" sections the "kube-state-metrics" and "collectd" items should be present and should be "UP"
137
+
* In the main search page look for the "node_package_power_avg" and "node_package_power_per_pod". Each of these queries should return and non-empty answer
138
+
139
+
140
+
4. Upgrade the HELM chart & restart the pods
88
141
89
-
The above command creates a new updated configuration and reboots Prometheus and the Prometheus Adapter. Collectd power metrics should now be available in the Prometheus database and, more importantly, inside the Kubernetes cluster itself. In order to confirm we can run a raw metrics query against the Kubernetes api.
***The name of the helm chart and the path to charts/prometheus_helm_chart might differ depending on your installation of Prometheus (what name you gave to the chart) and your current path.***
145
+
146
+
#### 3.2.2: Prometheus Adapter
147
+
148
+
1. Query & process the metric before exposing it
149
+
150
+
The newly created metrics now need to be exported from Prometheus to the Telemetry-Aware scheduler and to do so we need add two new rules in [custom-metrics-config-map.yaml](https://github.com/intel/platform-aware-scheduling/blob/master/telemetry-aware-scheduling/deploy/charts/prometheus_custom_metrics_helm_chart/templates/custom-metrics-config-map.yaml).
151
+
152
+
The config currently present in the [repo](https://github.com/intel/platform-aware-scheduling/blob/master/telemetry-aware-scheduling/deploy/charts/prometheus_custom_metrics_helm_chart/templates/custom-metrics-config-map.yaml) will expose these metrics by default, but if the metrics are not showing (see commands below) you can try adding the following ( the first item is responsible for fetching the "node_package_power_avg" metric, whereas the second is for "node_package_power_per_pod"):
153
+
154
+
```
155
+
- seriesQuery: '{__name__=~"^node_.*"}'
156
+
resources:
157
+
overrides:
158
+
instance:
159
+
resource: node
160
+
name:
161
+
matches: ^node_(.*)
162
+
metricsQuery: <<.Series>>
163
+
- seriesQuery: '{__name__=~"^node_.*",pod!=""}'
164
+
resources:
165
+
template: <<.Resource>>
166
+
metricsQuery: <<.Series>>
167
+
168
+
```
169
+
170
+
***Details about the rules and the schema are available [here](https://github.com/kubernetes-sigs/prometheus-adapter/blob/master/docs/config.md)***
***The name of the helm chart and the path to charts/prometheus_custom_metrics_helm_chart/ might differ depending on your installation of Prometheus (what name you have to the chart) and your current path.***
177
+
178
+
179
+
After running the above commands (section 3.1 or 3.2) Collectd power metrics should now be available in the Prometheus database and, more importantly, inside the Kubernetes cluster itself. In order to confirm the changes, we can run a raw metrics query against the Kubernetes api.
180
+
181
+
``kubectl get --raw /apis/custom.metrics.k8s.io/v1beta1/nodes/*/package_power_avg``
90
182
91
183
``kubectl get --raw "/apis/custom.metrics.k8s.io/v1beta1/namespaces/default/pods/*/node_package_power_per_pod"``
92
184
93
185
This command should return some json objects containing pods and associated power metrics:
94
186
187
+
95
188
```json
96
189
"kind": "MetricValueList",
97
190
"apiVersion": "custom.metrics.k8s.io/v1beta1",
@@ -183,7 +276,7 @@ TAS will schedule it to the node with the lowest power usage, and will avoid sch
0 commit comments