-
Notifications
You must be signed in to change notification settings - Fork 176
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Missing boskos metrics #3778
Comments
Velodrome is deployed but getting 404 from Google Health check |
from boskos-metrics |
Boskos is like 2 years old. We need to upgrade probably - not sure how much work that will be though |
Also boskos now seems to be 100% broken now. I think from 1.22 upgrade |
The only istio cluster that we upgraded to 1.22 was the Prow service cluster which doesn't appear to have any boskos deployments. The build cluster is still at 1.21, is that the cluster you are referring to? I'm guessing yes based on this: Lines 19 to 20 in 1067422
With respect to boskos metrics and alerts, we have some details about setting that up on the new monitoring stack using terraform here: https://github.com/GoogleCloudPlatform/oss-test-infra/tree/master/prow/oss/terraform#boskos-alerts |
Ah ok. Then perhaps it's not a 1.22 issue. However, it is broken for some
reason. Metrics had been broken for months and boskos has been entirely
broken for a few days.
…On Mon, Jan 31, 2022, 5:34 PM Cole Wagner ***@***.***> wrote:
Also boskos now seems to be 100% broken now. I think from 1.22 upgrade
The only istio cluster that we upgraded to 1.22 was the Prow service
cluster which doesn't appear to have any boskos deployments. The build
cluster is still at 1.21, is that the cluster you are referring to? I'm
guessing yes based on this:
https://github.com/istio/test-infra/blob/1067422a1559d3c3fb04cc904c2a0ec7753ca0dc/boskos/Makefile#L19-L20
With respect to boskos metrics and alerts, we have some details about
setting that up on the new monitoring stack using terraform here:
https://github.com/GoogleCloudPlatform/oss-test-infra/tree/master/prow/oss/terraform#boskos-alerts
We do need to configure Workload Metrics in the cluster + a PodMonitor
though.
—
Reply to this email directly, view it on GitHub
<#3778 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAEYGXNH7KXY3RCI7HGQGWTUY42CHANCNFSM5MKKSYDQ>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
The
The
I'm assuming this is
test-infra/boskos/cluster/boskos-deployment.yaml Lines 17 to 48 in 1067422
|
As far as I can tell there are two problems here: a) Boskos metrics not surfaced. For this issue, as @cjwagner had mentioned, it's recommended to migrate over to https://github.com/GoogleCloudPlatform/oss-test-infra/tree/master/prow/oss/terraform#boskos-alerts, @chizhg has experience with setting this up. b) Boskos stopped working. There was not enough information I could dig from this issue, my guess is that Boskos stopped recycling resources, which could be due to janitor failures. I have inspected janitor log and saw quite a few failures: https://cloudlogging.app.goo.gl/kf4qgd8Ta7sQDuch7. Let me know if you have trouble accessing this issue, I'll create a screenshot for you |
@chaodaiG those logs are from prow-internal, this issue is for the
|
you are right, that was a wrong link. Inspecting |
@howardjohn , @chizhg please take a look |
As a first step should we update boskos? Its almost 2 years old. Since its auth errors could be related to the WI changes we made recently |
sounds like a good idea to me |
Who owns this? |
The team that owns prow is in the process of providing minimal maintenance of Boskos repo. Users are still responsible for managing their own deployment |
Who are "users" in this context? We have a very fuzzy line from "Istio engineers", "Google Istio engineers", "Google Istio engineers that sometimes work on test related things" to "Google Prow team". I don't know where the lines are nor where this falls. Can we assign to a concrete individual? |
Tracked in istio#3778 to get these back up
* Remove jobs that have been broken for 3+ months Tracked in #3778 to get these back up * testgrid
any update about this? |
Drive by comment: @howardjohn I recently fixed this for Knative at knative/test-infra#3360. Grab the new PodMonitoring code from oss-test-infra repo and make sure the boskos deployment has the metrics port defined. Spend a good 20 minutes debugging unavailable metrics in Managed Prometheus. |
We are not using boskos now |
We used to have boskos metrics at http://velodrome.istio.io/. However, this now gives a 502.
https://monitoring.prow.istio.io/ has a bunch of stuff, but no boskos. It would be nice to see the boskos metrics again
The text was updated successfully, but these errors were encountered: