Skip to content

Commit cd40f01

Browse files
authored
feat: kube-scheduler local dev (#1116)
1 parent 543b2f1 commit cd40f01

File tree

4 files changed

+120
-36
lines changed

4 files changed

+120
-36
lines changed

Makefile

Lines changed: 5 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -40,22 +40,23 @@ dev: generate
4040

4141
.PHONY: dev-port-forward
4242
dev-port-forward:
43-
kubectl --context k3d-kubernetes-mixin port-forward service/lgtm 3000:3000 4317:4317 4318:4318 9090:9090
43+
kubectl --context kind-kubernetes-mixin wait --for=condition=Ready pods -l app=lgtm --timeout=300s
44+
kubectl --context kind-kubernetes-mixin port-forward service/lgtm 3000:3000 4317:4317 4318:4318 9090:9090
4445

4546
dev-reload: generate
4647
@cp -v prometheus_alerts.yaml scripts/provisioning/prometheus/ && \
4748
cp -v prometheus_rules.yaml scripts/provisioning/prometheus/ && \
48-
kubectl --context k3d-kubernetes-mixin rollout restart deployment/lgtm && \
49+
kubectl --context kind-kubernetes-mixin rollout restart deployment/lgtm && \
4950
echo '╔═══════════════════════════════════════════════════════════════╗' && \
5051
echo '║ ║' && \
5152
echo '║ 🔄 Reloading Alert and Recording Rules... ║' && \
5253
echo '║ ║' && \
5354
echo '╚═══════════════════════════════════════════════════════════════╝' && \
54-
kubectl --context k3d-kubernetes-mixin rollout status deployment/lgtm
55+
kubectl --context kind-kubernetes-mixin rollout status deployment/lgtm
5556

5657
.PHONY: dev-down
5758
dev-down:
58-
k3d cluster delete kubernetes-mixin
59+
kind delete cluster --name kubernetes-mixin
5960

6061
.PHONY: generate
6162
generate: prometheus_alerts.yaml prometheus_rules.yaml $(OUT_DIR)

README.md

Lines changed: 46 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -2,10 +2,41 @@
22

33
[![ci](https://github.com/kubernetes-monitoring/kubernetes-mixin/actions/workflows/ci.yaml/badge.svg)](https://github.com/kubernetes-monitoring/kubernetes-mixin/actions/workflows/ci.yaml)
44

5-
> NOTE: This project is *pre-release* stage. Flags, configuration, behaviour and design may change significantly in following releases.
6-
75
A set of Grafana dashboards and Prometheus alerts for Kubernetes.
86

7+
## Local development
8+
9+
Run the following command to setup a local [kind](https://kind.sigs.k8s.io) cluster:
10+
11+
```shell
12+
make dev
13+
```
14+
15+
You should see the following output if successful:
16+
17+
```shell
18+
╔═══════════════════════════════════════════════════════════════╗
19+
║ 🚀 Development Environment Ready! 🚀 ║
20+
║ ║
21+
║ Run `make dev-port-forward`
22+
║ Grafana will be available at http://localhost:3000 ║
23+
║ ║
24+
║ Data will be available in a few minutes. ║
25+
║ ║
26+
║ Dashboards will refresh every 10s, run `make generate`
27+
║ and refresh your browser to see the changes. ║
28+
║ ║
29+
║ Alert and recording rules require `make dev-reload`. ║
30+
║ ║
31+
╚═══════════════════════════════════════════════════════════════╝
32+
```
33+
34+
To delete the cluster, run the following:
35+
36+
```shell
37+
make dev-down
38+
```
39+
940
## Releases
1041

1142
> Note: Releases up until `release-0.12` are changes in their own branches. Changelogs are included in releases starting from [version-0.13.0](https://github.com/kubernetes-monitoring/kubernetes-mixin/releases/tag/version-0.13.0).
@@ -33,7 +64,7 @@ Some alerts now use Prometheus filters made available in Prometheus 2.11.0, whic
3364

3465
Warning: This compatibility matrix was initially created based on experience, we do not guarantee the compatibility, it may be updated based on new learnings.
3566

36-
Warning: By default the expressions will generate *grafana 7.2+* compatible rules using the *$__rate_interval* variable for rate functions. If you need backward compatible rules please set *grafana72: false* in your *_config*
67+
Warning: By default the expressions will generate *grafana 7.2+* compatible rules using the *$\_\_rate_interval* variable for rate functions. If you need backward compatible rules please set *grafana72: false* in your *\_config*
3768

3869
### Release steps
3970

@@ -75,6 +106,7 @@ node_namespace_pod_container:container_cpu_usage_seconds_total:sum_rate5m
75106
This mixin is designed to be vendored into the repo with your infrastructure config. To do this, use [jsonnet-bundler](https://github.com/jsonnet-bundler/jsonnet-bundler):
76107

77108
You then have three options for deploying your dashboards
109+
78110
1. Generate the config files and deploy them yourself
79111
2. Use ksonnet to deploy this mixin along with Prometheus and Grafana
80112
3. Use prometheus-operator to deploy this mixin (TODO)
@@ -109,11 +141,12 @@ The `prometheus_alerts.yaml` and `prometheus_rules.yaml` file then need to passe
109141
### Dashboards for Windows Nodes
110142

111143
There exist separate dashboards for windows resources.
112-
1) Compute Resources / Cluster(Windows)
113-
2) Compute Resources / Namespace(Windows)
114-
3) Compute Resources / Pod(Windows)
115-
4) USE Method / Cluster(Windows)
116-
5) USE Method / Node(Windows)
144+
145+
1. Compute Resources / Cluster(Windows)
146+
2. Compute Resources / Namespace(Windows)
147+
3. Compute Resources / Pod(Windows)
148+
4. USE Method / Cluster(Windows)
149+
5. USE Method / Node(Windows)
117150

118151
These dashboards are based on metrics populated by [windows-exporter](https://github.com/prometheus-community/windows_exporter) from each Windows node.
119152

@@ -270,14 +303,14 @@ Same result can be achieved by modyfying the existing `config.libsonnet` with th
270303

271304
While the community has not yet fully agreed on alert severities and their to be used, this repository assumes the following paradigms when setting the severities:
272305

273-
* Critical: An issue, that needs to page a person to take instant action
274-
* Warning: An issue, that needs to be worked on but in the regular work queue or for during office hours rather than paging the oncall
275-
* Info: Is meant to support a trouble shooting process by informing about a non-normal situation for one or more systems but not worth a page or ticket on its own.
306+
- Critical: An issue, that needs to page a person to take instant action
307+
- Warning: An issue, that needs to be worked on but in the regular work queue or for during office hours rather than paging the oncall
308+
- Info: Is meant to support a trouble shooting process by informing about a non-normal situation for one or more systems but not worth a page or ticket on its own.
276309

277310
### Architecture and Technical Decisions
278311

279-
* For more motivation, see "[The RED Method: How to instrument your services](https://kccncna17.sched.com/event/CU8K/the-red-method-how-to-instrument-your-services-b-tom-wilkie-kausal?iframe=no&w=100%&sidebar=yes&bg=no)" talk from CloudNativeCon Austin.
280-
* For more information about monitoring mixins, see this [design doc](DESIGN.md).
312+
- For more motivation, see "[The RED Method: How to instrument your services](https://kccncna17.sched.com/event/CU8K/the-red-method-how-to-instrument-your-services-b-tom-wilkie-kausal?iframe=no&w=100%&sidebar=yes&bg=no)" talk from CloudNativeCon Austin.
313+
- For more information about monitoring mixins, see this [design doc](DESIGN.md).
281314

282315
## Note
283316

scripts/lgtm.sh

Lines changed: 38 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -1,27 +1,47 @@
11
#!/bin/bash
2-
32
set -ex
43

5-
# export time in milliseconds
6-
# export OTEL_METRIC_EXPORT_INTERVAL=500
7-
8-
# use http instead of https (needed because of https://github.com/open-telemetry/opentelemetry-go/issues/4834)
9-
# export OTEL_EXPORTER_OTLP_INSECURE="true"
10-
11-
# https://github.com/grafana/docker-otel-lgtm/tree/main/examples
12-
13-
# docker run -p 3001:3000 -p 4317:4317 -p 4318:4318 \
14-
# -v ./provisioning/dashboards:/otel-lgtm/grafana/conf/provisioning/dashboards \
15-
# -v ../dashboards_out:/kubernetes-mixin/dashboards_out \
16-
# --rm -ti grafana/otel-lgtm
17-
184
cp ../prometheus_alerts.yaml provisioning/prometheus/
195
cp ../prometheus_rules.yaml provisioning/prometheus/
206

21-
# set up 1-node k3d cluster
22-
k3d cluster create kubernetes-mixin \
23-
-v "$PWD"/provisioning:/kubernetes-mixin/provisioning \
24-
-v "$PWD"/../dashboards_out:/kubernetes-mixin/dashboards_out
7+
# Create kind cluster with kube-scheduler resource metrics enabled
8+
kind create cluster --name kubernetes-mixin --config - <<EOF
9+
kind: Cluster
10+
apiVersion: kind.x-k8s.io/v1alpha4
11+
nodes:
12+
- role: control-plane
13+
kubeadmConfigPatches:
14+
- |
15+
kind: ClusterConfiguration
16+
scheduler:
17+
extraArgs:
18+
authorization-always-allow-paths: "/metrics,/metrics/resources"
19+
bind-address: "0.0.0.0"
20+
extraMounts:
21+
- hostPath: "$PWD/provisioning"
22+
containerPath: /kubernetes-mixin/provisioning
23+
- hostPath: "$PWD/../dashboards_out"
24+
containerPath: /kubernetes-mixin/dashboards_out
25+
EOF
26+
27+
# Wait for cluster to be ready
28+
kubectl wait --for=condition=Ready nodes --all --timeout=300s
29+
30+
# Create kube-scheduler service for metrics access
31+
kubectl apply -f - <<EOF
32+
apiVersion: v1
33+
kind: Service
34+
metadata:
35+
name: kube-scheduler
36+
namespace: kube-system
37+
spec:
38+
selector:
39+
component: kube-scheduler
40+
ports:
41+
- port: 10259
42+
targetPort: 10259
43+
protocol: TCP
44+
EOF
2545

2646
# run grafana, prometheus
2747
kubectl apply -f lgtm.yaml

scripts/otel-collector-deployment.values.yaml

Lines changed: 31 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -29,6 +29,7 @@ clusterRole:
2929
- watch
3030
- nonResourceURLs:
3131
- /metrics
32+
- /metrics/resources
3233
verbs:
3334
- get
3435
- apiGroups:
@@ -130,7 +131,36 @@ config:
130131
target_label: instance
131132
- source_labels: [__meta_kubernetes_namespace]
132133
target_label: namespace
133-
134+
135+
- job_name: kube-scheduler
136+
kubernetes_sd_configs:
137+
- role: service
138+
relabel_configs:
139+
- source_labels: [__meta_kubernetes_service_name]
140+
action: keep
141+
regex: kube-scheduler
142+
- source_labels: [__meta_kubernetes_namespace]
143+
action: keep
144+
regex: kube-system
145+
scheme: https
146+
tls_config:
147+
insecure_skip_verify: true
148+
149+
- job_name: kube-scheduler-resources
150+
kubernetes_sd_configs:
151+
- role: service
152+
relabel_configs:
153+
- source_labels: [__meta_kubernetes_service_name]
154+
action: keep
155+
regex: kube-scheduler
156+
- source_labels: [__meta_kubernetes_namespace]
157+
action: keep
158+
regex: kube-system
159+
metrics_path: /metrics/resources
160+
scheme: https
161+
tls_config:
162+
insecure_skip_verify: true
163+
134164
processors:
135165
batch: {}
136166

0 commit comments

Comments
 (0)