Skip to content

Conversation

@AndreasHolt
Copy link
Contributor

@AndreasHolt AndreasHolt commented Nov 20, 2025

What changed?

  • Updated DeleteShardStats to batch deletions into chunks (size 64) to stay under the etcd transaction limit (128 ops), while preserving the leader guard for safety.
  • Fixed a bug in the runShardStatsCleanupLoop where it was checking len > 0 instead of len == 0, causing it to skip cleanup when stale stats were actually present.

Why?

  • Batching: Shard stats cleanup may involve many keys. Writing them in a single transaction can risk exceeding etcd's 128-op limit.
  • Loop Fix: The previous logic inverted the check, meaning cleanup only ran if there was nothing to clean, effectively disabling the feature.

How did you test it?

  • Added TestDeleteShardStatsDeletesLargeBatches to verify batching behavior.
  • Unit coverage already present for the processor/store and running the canary service.

Potential risks
If a leadership change occurs in the middle of processing multiple batches, the operation will fail. The remaining stale stats will simply be picked up and cleaned on the next periodic tick of the cleanup loop.

Release notes

Documentation Changes

if !resp.Succeeded {
return fmt.Errorf("transaction failed, leadership may have changed")
if _, err := s.client.Delete(ctx, shardStatsKey); err != nil {
return fmt.Errorf("delete shard statistics: %w", err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think removing the guard we are going to have several concurrent delets, that are going to result in errors and create a lot of noise in logs and metrics

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’ve updated the implementation to keep the guard on these deletions. They are now processed in batches (for now of 64) to satisfy the etcd transaction limit. Every batch is guarded to prevent concurrent deletes and log noise

@eleonoradgr eleonoradgr merged commit 125b038 into cadence-workflow:master Nov 24, 2025
76 of 78 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants