Skip to content

Add store performance grafana alerts#5255

Open
jagathweerasinghe-da wants to merge 13 commits intomainfrom
wee/add_store_performance_grafana_alerts
Open

Add store performance grafana alerts#5255
jagathweerasinghe-da wants to merge 13 commits intomainfrom
wee/add_store_performance_grafana_alerts

Conversation

@jagathweerasinghe-da
Copy link
Copy Markdown
Contributor

No description provided.

conditions:
- evaluator:
type: gt
params: [300]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My main worry is, with an example:

  • Ingestion test now takes ~200s
  • A day later somebody does a change that makes ingestion 25% slower (that's very significant)
  • Now the test takes 250s, which would still be < 300, we wouldn't alert, and likely nobody would notice

This is fine as an approach (per the design doc, "the CI run fails if the performance is below expected throughputs"), but then:

  • These thresholds need to be tight enough that we'd detect big swings, but not flake (too often, I guess). They might be already, I cannot tell from this PR only
  • We need to document our expectations to contributors wrt these tests (DEVELOPMENT.md/CONTRIBUTING.md, maybe?). From the design doc: "We expect that most changes won’t affect ingestion performance and for the ones that do the PR author runs the performance test. We can still switch to a more restrictive approach if we see that it is necessary."
  • Likewise we should document for us what to do if these alerts get triggered (debug/check commits/bump the threshold here...)
  • We should document (a comment here is fine) how/when these thresholds were set for reference. E.g. "performance tests take 280s~ as of yyyy-mm-dd"

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add the documentation to PERFORMANCE.md as suggested by the design doc.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Working on determining the threshold. We need to take into the datasize/batchsize as well as the dataset changes over time.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

btw, do you know where do the alerts from the splice cluster end up?

Copy link
Copy Markdown
Contributor Author

@jagathweerasinghe-da jagathweerasinghe-da Apr 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It ends up in team-canton-network-internal-ci
./cluster/deployment/splice/.envrc.vars
export SLACK_ALERT_NOTIFICATION_CHANNEL_FULL_NAME="team-canton-network-internal-ci"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pretty sure that that channel is mostly ignored in favor of the test failures dashboard, you might want to ask for opinions on #internal

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@OriolMunoz-da

As per the internal discussion we had, let's go with the GH issue. Thanks for bringing this point:

There are two routes to create a GH issue:

  1. GH Action->Prometheous->Grafana Alerts->GH Issue
  2. GH Action->GH Issue

I would like go ahead with the second approach, which makes the integration clean and simple. WDYT?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, 2 seems easiest

to: 0
datasourceUid: prometheus
model:
expr: splice_perf_ingestion_total_time_ns{test="ScanStoreIngestionPerformanceTest"} / 1e9
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there no easy way to have a single alert instead of a separate one? I'd like to not have to add an alert when we add a test, which is easy to forget

I can imagine that the different threshold per test makes it difficult, but maybe there's a way to do that

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Simplified the rules, so that the changes needed when adding a new test is minimum.

@jagathweerasinghe-da jagathweerasinghe-da force-pushed the wee/add_store_performance_grafana_alerts branch 2 times, most recently from 3395bfa to 2d0d565 Compare April 29, 2026 13:30
[static]

Signed-off-by: Jagath Weerasinghe <[email protected]>
[static]

Signed-off-by: Jagath Weerasinghe <[email protected]>
[static]

Signed-off-by: Jagath Weerasinghe <[email protected]>
[static]

Signed-off-by: Jagath Weerasinghe <[email protected]>
[static]

Signed-off-by: Jagath Weerasinghe <[email protected]>
[static]

Signed-off-by: Jagath Weerasinghe <[email protected]>
[static]

Signed-off-by: Jagath Weerasinghe <[email protected]>
[static]

Signed-off-by: Jagath Weerasinghe <[email protected]>
@jagathweerasinghe-da jagathweerasinghe-da force-pushed the wee/add_store_performance_grafana_alerts branch from 85264ee to e96334e Compare April 30, 2026 14:34
[static]

Signed-off-by: Jagath Weerasinghe <[email protected]>
[static]

Signed-off-by: Jagath Weerasinghe <[email protected]>
[static]

Signed-off-by: Jagath Weerasinghe <[email protected]>
[static]

Signed-off-by: Jagath Weerasinghe <[email protected]>
[static]

Signed-off-by: Jagath Weerasinghe <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants