-
Notifications
You must be signed in to change notification settings - Fork 3.6k
feat: Add WaitWithTimeout to Partition and WaitGroupTimeout #26294
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master-1.x
Are you sure you want to change the base?
Conversation
This PR makes it easier to debug potential hanging retention service routines during DeleteShard.
pkg/wg_timeout/wg_timeout.go
Outdated
func WaitGroupTimeout(wg *sync.WaitGroup, timeout time.Duration) bool { | ||
c := make(chan struct{}) | ||
|
||
go func() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
take unidirectional channel as parameter
pkg/wg_timeout/wg_timeout.go
Outdated
case <-c: | ||
return false | ||
case <-timer.C: | ||
return true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Log here, but do not exit.
pkg/wg_timeout/wg_timeout.go
Outdated
"time" | ||
) | ||
|
||
func WaitGroupTimeout(wg *sync.WaitGroup, timeout time.Duration) bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would take a zap.Logger
and a message here. Print the message at Warn
level each time around the loop, with the total elapsed duration as a zap
field. No return value - make this like Wait
I would not return on timer ticks; that will change the behavior of the system. Just log the wait.
Style suggestion: take c
as an argument to the lambda, restricting its directionality, and renaming it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In line comments
tsdb/index/tsi1/partition.go
Outdated
// If the loop goes for > duration it will return true (timedOut) | ||
// if it does not time out it returns false (!timedOut) | ||
func (p *Partition) WaitWithTimeout(duration time.Duration) bool { | ||
timeout := time.NewTimer(duration) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would get rid of timeout
and just have a ticker. In the tick case
, first check for p.CurrentCompactionN() == 0
then check if time.Since(startOfMethod) > duration
where startOfMethod
is a time.Now()
on function entry
This PR makes it easier to debug potential hanging retention service routines during DeleteShard.
Currently we are seeing the following traces within goroutine profiles for customers that are experiences issues where shards are persisting after the retention policy.
Where
Wait
is here https://github.com/influxdata/influxdb/pull/26294/files#diff-55346f580e7216556be601bef5602df49cf19af75131749c46096475d68126f9R379I believe that somehow we are infinitely waiting for
CurrentCompactionN
and yet we are never decrementing to 0. This is causing the retention policy to hang when it gets toPartition.Close
The following PR will not resolve the issue but it will show whether my theory is correct and the root cause of the issue.