Skip to content

Conversation

@Theis-Mathiassen
Copy link

@Theis-Mathiassen Theis-Mathiassen commented Nov 11, 2025

What changed?
We added functionality to record the load as a moving average for each shard, where the weight of a new data point depends on how recently the average was last updated.

Why?
This is done to smooth the load input for the shard distributor, this is desirable as the load can change sporadically.
It is also necessary to save the load of each shard in ETCD, as to persist it (In case the handler crashes) and make it available to each instance of shard distributors.

How did you test it?
We have created some unit tests, and tried to run it with the canary service:
TestRecordHeartbeatUpdatesShardStatistics:
This test verifies that when an executor sends a heartbeat with ShardLoad information for a shard, the ShardStatistics for that shard are correctly updated in the store, specifically the SmoothedLoad and LastUpdateTime. It also ensures that LastMoveTime remains unchanged if not explicitly updated.

TestRecordHeartbeatSkipsShardStatisticsWithNilReport:
This test ensures that if an executor's heartbeat includes a nil ShardStatusReport for a particular shard, the existing ShardStatistics for that shard are not updated or created. It also verifies that valid shard reports are processed correctly.

Potential risks
None, since it is for the shard distributor, which is not utilized in production yet.

Release notes
It is not, since it is for the shard distributor, which is not utilized in production yet.

Documentation Changes
No, but maybe some documentation should be created, later.

AndreasHolt and others added 27 commits October 20, 2025 14:05
… is being reassigned in AssignShard

Signed-off-by: Andreas Holt <[email protected]>
…to not overload etcd's 128 max ops per txn

Signed-off-by: Andreas Holt <[email protected]>
…s txn and retry monotonically

Signed-off-by: Andreas Holt <[email protected]>
…shard metrics, move out to staging to separate function

Signed-off-by: Andreas Holt <[email protected]>
… And more idiomatic naming of collection vs singular type

Signed-off-by: Andreas Holt <[email protected]>
…ook more like executor key tests

Signed-off-by: Andreas Holt <[email protected]>
…ey in BuildShardKey, as we don't use it

Signed-off-by: Andreas Holt <[email protected]>
AndreasHolt and others added 17 commits November 11, 2025 15:57
…ents

Signed-off-by: Andreas Holt <[email protected]>
Signed-off-by: Theis Randeris Mathiassen <[email protected]>
…shard metrics, move out to staging to separate function

Signed-off-by: Andreas Holt <[email protected]>
Signed-off-by: Theis Randeris Mathiassen <[email protected]>
Signed-off-by: Andreas Holt <[email protected]>
Signed-off-by: Theis Randeris Mathiassen <[email protected]>
… And more idiomatic naming of collection vs singular type

Signed-off-by: Andreas Holt <[email protected]>
Signed-off-by: Theis Randeris Mathiassen <[email protected]>
…o "statistics"

Signed-off-by: Andreas Holt <[email protected]>
Signed-off-by: Theis Randeris Mathiassen <[email protected]>
…ollow conventions

Signed-off-by: Andreas Holt <[email protected]>
Signed-off-by: Theis Randeris Mathiassen <[email protected]>
Signed-off-by: Andreas Holt <[email protected]>
Signed-off-by: Theis Randeris Mathiassen <[email protected]>
…eartbeat TTL

Signed-off-by: Andreas Holt <[email protected]>
Signed-off-by: Theis Randeris Mathiassen <[email protected]>
…o ewma)

Signed-off-by: Andreas Holt <[email protected]>
Signed-off-by: Theis Randeris Mathiassen <[email protected]>
…t heartbeat

Signed-off-by: Andreas Holt <[email protected]>
Signed-off-by: Theis Randeris Mathiassen <[email protected]>
…rdStatistics

Signed-off-by: Andreas Holt <[email protected]>
Signed-off-by: Theis Randeris Mathiassen <[email protected]>
Signed-off-by: Andreas Holt <[email protected]>
Signed-off-by: Theis Randeris Mathiassen <[email protected]>
Signed-off-by: Andreas Holt <[email protected]>
Signed-off-by: Theis Randeris Mathiassen <[email protected]>
Signed-off-by: Theis Randeris Mathiassen <[email protected]>
@Theis-Mathiassen Theis-Mathiassen force-pushed the heartbeat-shard-statistics branch from 293a154 to 8c6b0c8 Compare November 11, 2025 14:58
@Theis-Mathiassen Theis-Mathiassen marked this pull request as ready for review November 11, 2025 15:00
continue
}

_, err = s.client.Put(ctx, shardStatsKey, string(payload))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am worried that we are generating too much writes load to etcd,do you think we can avoid writing if the the load is stable?

Copy link
Author

@Theis-Mathiassen Theis-Mathiassen Nov 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I think that is possible, so as an example if load from heartbeat is within +/- 5% of the currently stored value, we do not update it?

Some problems with this approach might be:

Because of how ewmaSmoothedLoad calculates the new load, if there have been a longer time since last load change, the new load will have greater impact.

So I fear if we have a very stable shard load with some spikes in the load, the smoothing effect is lost, excuse my drawing, but i hope it illustrates the idea:
image

It might still be possible, we will probably just have to rethink the update function, how we can avoid this.
We will look into how to solve this, and comment again when we have something.

Copy link
Contributor

@AndreasHolt AndreasHolt Nov 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have implemented something that can help throttle the writes in 158e030.
We now only write

  1. If load has fluctuated enough
  2. If enough time has passed since last write (we don't want to cause a stats cleanup just because load hasn't fluctuated)

Edit:
e0779ec decouples the stats cleanup, which before this change was coupled to heartbeat TTL when determining whether a shard stat is stale.

@AndreasHolt AndreasHolt force-pushed the heartbeat-shard-statistics branch from 5f564d7 to dd27ab1 Compare November 18, 2025 16:51
@AndreasHolt AndreasHolt force-pushed the heartbeat-shard-statistics branch from dd27ab1 to e0779ec Compare November 18, 2025 16:57
@Theis-Mathiassen Theis-Mathiassen force-pushed the heartbeat-shard-statistics branch from 13c5850 to 481f9c6 Compare November 20, 2025 08:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants