Add store performance grafana alerts#5255
Add store performance grafana alerts#5255jagathweerasinghe-da wants to merge 13 commits intomainfrom
Conversation
| conditions: | ||
| - evaluator: | ||
| type: gt | ||
| params: [300] |
There was a problem hiding this comment.
My main worry is, with an example:
- Ingestion test now takes ~200s
- A day later somebody does a change that makes ingestion 25% slower (that's very significant)
- Now the test takes 250s, which would still be < 300, we wouldn't alert, and likely nobody would notice
This is fine as an approach (per the design doc, "the CI run fails if the performance is below expected throughputs"), but then:
- These thresholds need to be tight enough that we'd detect big swings, but not flake (too often, I guess). They might be already, I cannot tell from this PR only
- We need to document our expectations to contributors wrt these tests (DEVELOPMENT.md/CONTRIBUTING.md, maybe?). From the design doc: "We expect that most changes won’t affect ingestion performance and for the ones that do the PR author runs the performance test. We can still switch to a more restrictive approach if we see that it is necessary."
- Likewise we should document for us what to do if these alerts get triggered (debug/check commits/bump the threshold here...)
- We should document (a comment here is fine) how/when these thresholds were set for reference. E.g. "performance tests take 280s~ as of yyyy-mm-dd"
There was a problem hiding this comment.
Let's add the documentation to PERFORMANCE.md as suggested by the design doc.
There was a problem hiding this comment.
Working on determining the threshold. We need to take into the datasize/batchsize as well as the dataset changes over time.
There was a problem hiding this comment.
btw, do you know where do the alerts from the splice cluster end up?
There was a problem hiding this comment.
It ends up in team-canton-network-internal-ci
./cluster/deployment/splice/.envrc.vars
export SLACK_ALERT_NOTIFICATION_CHANNEL_FULL_NAME="team-canton-network-internal-ci"
There was a problem hiding this comment.
pretty sure that that channel is mostly ignored in favor of the test failures dashboard, you might want to ask for opinions on #internal
There was a problem hiding this comment.
As per the internal discussion we had, let's go with the GH issue. Thanks for bringing this point:
There are two routes to create a GH issue:
- GH Action->Prometheous->Grafana Alerts->GH Issue
- GH Action->GH Issue
I would like go ahead with the second approach, which makes the integration clean and simple. WDYT?
There was a problem hiding this comment.
yeah, 2 seems easiest
| to: 0 | ||
| datasourceUid: prometheus | ||
| model: | ||
| expr: splice_perf_ingestion_total_time_ns{test="ScanStoreIngestionPerformanceTest"} / 1e9 |
There was a problem hiding this comment.
is there no easy way to have a single alert instead of a separate one? I'd like to not have to add an alert when we add a test, which is easy to forget
I can imagine that the different threshold per test makes it difficult, but maybe there's a way to do that
There was a problem hiding this comment.
Simplified the rules, so that the changes needed when adding a new test is minimum.
3395bfa to
2d0d565
Compare
[static] Signed-off-by: Jagath Weerasinghe <[email protected]>
[static] Signed-off-by: Jagath Weerasinghe <[email protected]>
[static] Signed-off-by: Jagath Weerasinghe <[email protected]>
[static] Signed-off-by: Jagath Weerasinghe <[email protected]>
[static] Signed-off-by: Jagath Weerasinghe <[email protected]>
[static] Signed-off-by: Jagath Weerasinghe <[email protected]>
[static] Signed-off-by: Jagath Weerasinghe <[email protected]>
[static] Signed-off-by: Jagath Weerasinghe <[email protected]>
85264ee to
e96334e
Compare
[static] Signed-off-by: Jagath Weerasinghe <[email protected]>
[static] Signed-off-by: Jagath Weerasinghe <[email protected]>
[static] Signed-off-by: Jagath Weerasinghe <[email protected]>
[static] Signed-off-by: Jagath Weerasinghe <[email protected]>
[static] Signed-off-by: Jagath Weerasinghe <[email protected]>
No description provided.