Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Missing boskos metrics #3778

Closed
howardjohn opened this issue Jan 19, 2022 · 20 comments
Closed

Missing boskos metrics #3778

howardjohn opened this issue Jan 19, 2022 · 20 comments

Comments

@howardjohn
Copy link
Member

We used to have boskos metrics at http://velodrome.istio.io/. However, this now gives a 502.

https://monitoring.prow.istio.io/ has a bunch of stuff, but no boskos. It would be nice to see the boskos metrics again

@howardjohn
Copy link
Member Author

Velodrome is deployed but getting 404 from Google Health check

@howardjohn
Copy link
Member Author

{"component":"metrics","file":"/go/src/app/cmd/metrics/metrics.go:127","func":"main.handleMetric.func1","handler":"handleMetric","level":"warning","msg":"type must be set in the request.","severity":"warning","time":"2022-01-31T19:38:29Z"}

from boskos-metrics

@howardjohn
Copy link
Member Author

Boskos is like 2 years old. We need to upgrade probably - not sure how much work that will be though

@howardjohn
Copy link
Member Author

Also boskos now seems to be 100% broken now. I think from 1.22 upgrade

@chizhg
Copy link
Contributor

chizhg commented Jan 31, 2022

/cc @cjwagner @chaodaiG could you help take a look? Thanks!

@cjwagner
Copy link
Member

cjwagner commented Feb 1, 2022

Also boskos now seems to be 100% broken now. I think from 1.22 upgrade

The only istio cluster that we upgraded to 1.22 was the Prow service cluster which doesn't appear to have any boskos deployments. The build cluster is still at 1.21, is that the cluster you are referring to? I'm guessing yes based on this:

deploy: get-cluster-credentials get-api-resources
kubectl apply -Rf cluster --prune -l app.kubernetes.io/part-of=boskos $(PRUNE_WL)

With respect to boskos metrics and alerts, we have some details about setting that up on the new monitoring stack using terraform here: https://github.com/GoogleCloudPlatform/oss-test-infra/tree/master/prow/oss/terraform#boskos-alerts
We do need to configure Workload Metrics in the cluster + a PodMonitor though.

@howardjohn
Copy link
Member Author

howardjohn commented Feb 1, 2022 via email

@cjwagner
Copy link
Member

cjwagner commented Feb 1, 2022

The boskos-metrics component has a lot of logs like

{"component":"metrics","file":"/go/src/app/cmd/metrics/metrics.go:127","func":"main.handleMetric.func1","handler":"handleMetric","level":"warning","msg":"type must be set in the request.","severity":"warning","time":"2022-02-01T01:44:12Z"}

The boskos-mason component has logs like:

E0129 08:12:00.814176 1 reflector.go:153] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:105: Failed to list *crds.DRLCObject: Get https://10.0.0.1:443/apis/boskos.k8s.io/v1/namespaces/boskos/dynamicresourcelifecycles?limit=500&resourceVersion=0: dial tcp 10.0.0.1:443: connect: connection refused

I'm assuming this is boskos-mason attempting to interact with the build cluster itself which is should have the correct RBAC for:

serviceAccountName: boskos-admin

---
kind: ClusterRole
apiVersion: rbac.authorization.k8s.io/v1
metadata:
labels:
app.kubernetes.io/part-of: boskos
name: boskos-crd-admin
rules:
- apiGroups:
- apiextensions.k8s.io
verbs: ["*"]
resources:
- customresourcedefinitions
- apiGroups: ["boskos.k8s.io"]
verbs: ["*"]
resources: ["*"]
---
kind: ClusterRoleBinding
apiVersion: rbac.authorization.k8s.io/v1
metadata:
labels:
app.kubernetes.io/part-of: boskos
name: boskos-crd-admin-binding
subjects:
- kind: ServiceAccount
name: boskos-admin
namespace: boskos
roleRef:
kind: ClusterRole
apiGroup: rbac.authorization.k8s.io
name: boskos-crd-admin

@chaodaiG
Copy link
Contributor

chaodaiG commented Feb 1, 2022

As far as I can tell there are two problems here:

a) Boskos metrics not surfaced. For this issue, as @cjwagner had mentioned, it's recommended to migrate over to https://github.com/GoogleCloudPlatform/oss-test-infra/tree/master/prow/oss/terraform#boskos-alerts, @chizhg has experience with setting this up.

b) Boskos stopped working. There was not enough information I could dig from this issue, my guess is that Boskos stopped recycling resources, which could be due to janitor failures. I have inspected janitor log and saw quite a few failures: https://cloudlogging.app.goo.gl/kf4qgd8Ta7sQDuch7. Let me know if you have trouble accessing this issue, I'll create a screenshot for you

@howardjohn
Copy link
Member Author

@chaodaiG those logs are from prow-internal, this issue is for the prow cluster.

resource.type="k8s_container"
resource.labels.project_id="istio-prow-build"
resource.labels.namespace_name="boskos"
resource.labels.container_name="boskos"
severity=ERROR

@chaodaiG
Copy link
Contributor

chaodaiG commented Feb 1, 2022

you are right, that was a wrong link. Inspecting prow cluster and it also has the same error https://cloudlogging.app.goo.gl/1ryxri5TdbMZscyT7

@chaodaiG
Copy link
Contributor

chaodaiG commented Feb 3, 2022

you are right, that was a wrong link. Inspecting prow cluster and it also has the same error https://cloudlogging.app.goo.gl/1ryxri5TdbMZscyT7

@howardjohn , @chizhg please take a look

@howardjohn
Copy link
Member Author

As a first step should we update boskos? Its almost 2 years old. Since its auth errors could be related to the WI changes we made recently

@chaodaiG
Copy link
Contributor

chaodaiG commented Feb 3, 2022

As a first step should we update boskos? Its almost 2 years old. Since its auth errors could be related to the WI changes we made recently

sounds like a good idea to me

@howardjohn
Copy link
Member Author

Who owns this?

@chaodaiG
Copy link
Contributor

The team that owns prow is in the process of providing minimal maintenance of Boskos repo. Users are still responsible for managing their own deployment

@howardjohn
Copy link
Member Author

Who are "users" in this context? We have a very fuzzy line from "Istio engineers", "Google Istio engineers", "Google Istio engineers that sometimes work on test related things" to "Google Prow team". I don't know where the lines are nor where this falls. Can we assign to a concrete individual?

howardjohn added a commit to howardjohn/test-infra that referenced this issue Mar 21, 2022
istio-testing pushed a commit that referenced this issue Mar 21, 2022
* Remove jobs that have been broken for 3+ months

Tracked in #3778 to get these
back up

* testgrid
@zirain
Copy link
Member

zirain commented Mar 23, 2022

any update about this?

@upodroid
Copy link
Contributor

upodroid commented Jun 3, 2022

Drive by comment:

@howardjohn I recently fixed this for Knative at knative/test-infra#3360. Grab the new PodMonitoring code from oss-test-infra repo and make sure the boskos deployment has the metrics port defined. Spend a good 20 minutes debugging unavailable metrics in Managed Prometheus.

@howardjohn
Copy link
Member Author

We are not using boskos now

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants