-
Notifications
You must be signed in to change notification settings - Fork 869
feat: Heartbeat shard statistics #7431
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
feat: Heartbeat shard statistics #7431
Conversation
Signed-off-by: Andreas Holt <[email protected]>
Signed-off-by: Andreas Holt <[email protected]>
… is being reassigned in AssignShard Signed-off-by: Andreas Holt <[email protected]>
Signed-off-by: Andreas Holt <[email protected]>
Signed-off-by: Andreas Holt <[email protected]>
Signed-off-by: Andreas Holt <[email protected]>
…to not overload etcd's 128 max ops per txn Signed-off-by: Andreas Holt <[email protected]>
…s txn and retry monotonically Signed-off-by: Andreas Holt <[email protected]>
…ents Signed-off-by: Andreas Holt <[email protected]>
…shard metrics, move out to staging to separate function Signed-off-by: Andreas Holt <[email protected]>
Signed-off-by: Andreas Holt <[email protected]>
Signed-off-by: Andreas Holt <[email protected]>
… And more idiomatic naming of collection vs singular type Signed-off-by: Andreas Holt <[email protected]>
…ook more like executor key tests Signed-off-by: Andreas Holt <[email protected]>
…ey in BuildShardKey, as we don't use it Signed-off-by: Andreas Holt <[email protected]>
…o "statistics" Signed-off-by: Andreas Holt <[email protected]>
…ollow conventions Signed-off-by: Andreas Holt <[email protected]>
Signed-off-by: Andreas Holt <[email protected]>
Signed-off-by: Andreas Holt <[email protected]>
Signed-off-by: Andreas Holt <[email protected]>
…eartbeat TTL Signed-off-by: Andreas Holt <[email protected]>
…o ewma) Signed-off-by: Andreas Holt <[email protected]>
…t heartbeat Signed-off-by: Andreas Holt <[email protected]>
…rdStatistics Signed-off-by: Andreas Holt <[email protected]>
Signed-off-by: Andreas Holt <[email protected]>
Signed-off-by: Theis Randeris Mathiassen <[email protected]>
…ents Signed-off-by: Andreas Holt <[email protected]> Signed-off-by: Theis Randeris Mathiassen <[email protected]>
…shard metrics, move out to staging to separate function Signed-off-by: Andreas Holt <[email protected]> Signed-off-by: Theis Randeris Mathiassen <[email protected]>
Signed-off-by: Andreas Holt <[email protected]> Signed-off-by: Theis Randeris Mathiassen <[email protected]>
… And more idiomatic naming of collection vs singular type Signed-off-by: Andreas Holt <[email protected]> Signed-off-by: Theis Randeris Mathiassen <[email protected]>
…o "statistics" Signed-off-by: Andreas Holt <[email protected]> Signed-off-by: Theis Randeris Mathiassen <[email protected]>
…ollow conventions Signed-off-by: Andreas Holt <[email protected]> Signed-off-by: Theis Randeris Mathiassen <[email protected]>
Signed-off-by: Andreas Holt <[email protected]> Signed-off-by: Theis Randeris Mathiassen <[email protected]>
…eartbeat TTL Signed-off-by: Andreas Holt <[email protected]> Signed-off-by: Theis Randeris Mathiassen <[email protected]>
…o ewma) Signed-off-by: Andreas Holt <[email protected]> Signed-off-by: Theis Randeris Mathiassen <[email protected]>
…t heartbeat Signed-off-by: Andreas Holt <[email protected]> Signed-off-by: Theis Randeris Mathiassen <[email protected]>
…rdStatistics Signed-off-by: Andreas Holt <[email protected]> Signed-off-by: Theis Randeris Mathiassen <[email protected]>
Signed-off-by: Andreas Holt <[email protected]> Signed-off-by: Theis Randeris Mathiassen <[email protected]>
Signed-off-by: Theis Randeris Mathiassen <[email protected]>
Signed-off-by: Andreas Holt <[email protected]> Signed-off-by: Theis Randeris Mathiassen <[email protected]>
Signed-off-by: Theis Randeris Mathiassen <[email protected]>
Signed-off-by: Theis Randeris Mathiassen <[email protected]>
Signed-off-by: Theis Randeris Mathiassen <[email protected]>
293a154 to
8c6b0c8
Compare
| continue | ||
| } | ||
|
|
||
| _, err = s.client.Put(ctx, shardStatsKey, string(payload)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am worried that we are generating too much writes load to etcd,do you think we can avoid writing if the the load is stable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I think that is possible, so as an example if load from heartbeat is within +/- 5% of the currently stored value, we do not update it?
Some problems with this approach might be:
Because of how ewmaSmoothedLoad calculates the new load, if there have been a longer time since last load change, the new load will have greater impact.
So I fear if we have a very stable shard load with some spikes in the load, the smoothing effect is lost, excuse my drawing, but i hope it illustrates the idea:

It might still be possible, we will probably just have to rethink the update function, how we can avoid this.
We will look into how to solve this, and comment again when we have something.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have implemented something that can help throttle the writes in 158e030.
We now only write
- If load has fluctuated enough
- If enough time has passed since last write (we don't want to cause a stats cleanup just because load hasn't fluctuated)
Edit:
e0779ec decouples the stats cleanup, which before this change was coupled to heartbeat TTL when determining whether a shard stat is stale.
Signed-off-by: Andreas Holt <[email protected]>
…adencefork into heartbeat-shard-statistics Signed-off-by: Andreas Holt <[email protected]>
Signed-off-by: Andreas Holt <[email protected]>
5f564d7 to
dd27ab1
Compare
…on from heartbeat TTL Signed-off-by: Andreas Holt <[email protected]>
dd27ab1 to
e0779ec
Compare
Signed-off-by: Andreas Holt <[email protected]>
…nstead of new ewma for determinig if it should persist Signed-off-by: Theis Mathiassen <[email protected]>
13c5850 to
481f9c6
Compare
What changed?
We added functionality to record the load as a moving average for each shard, where the weight of a new data point depends on how recently the average was last updated.
Why?
This is done to smooth the load input for the shard distributor, this is desirable as the load can change sporadically.
It is also necessary to save the load of each shard in ETCD, as to persist it (In case the handler crashes) and make it available to each instance of shard distributors.
How did you test it?
We have created some unit tests, and tried to run it with the canary service:
TestRecordHeartbeatUpdatesShardStatistics:
This test verifies that when an executor sends a heartbeat with ShardLoad information for a shard, the ShardStatistics for that shard are correctly updated in the store, specifically the SmoothedLoad and LastUpdateTime. It also ensures that LastMoveTime remains unchanged if not explicitly updated.
TestRecordHeartbeatSkipsShardStatisticsWithNilReport:
This test ensures that if an executor's heartbeat includes a nil ShardStatusReport for a particular shard, the existing ShardStatistics for that shard are not updated or created. It also verifies that valid shard reports are processed correctly.
Potential risks
None, since it is for the shard distributor, which is not utilized in production yet.
Release notes
It is not, since it is for the shard distributor, which is not utilized in production yet.
Documentation Changes
No, but maybe some documentation should be created, later.