Skip to content

feat: Add WaitWithTimeout to Partition and WaitGroupTimeout #26294

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 11 commits into
base: master-1.x
Choose a base branch
from

Conversation

devanbenz
Copy link

@devanbenz devanbenz commented Apr 18, 2025

This PR makes it easier to debug potential hanging retention service routines during DeleteShard.

Currently we are seeing the following traces within goroutine profiles for customers that are experiences issues where shards are persisting after the retention policy.

      1103   runtime.gopark
             runtime.selectgo
             github.com/influxdata/influxdb/tsdb/index/tsi1.(*Partition).runPeriodicCompaction
        32   runtime.gopark
             runtime.goparkunlock (inline)
             runtime.semacquire1
             sync.runtime_Semacquire
             sync.(*WaitGroup).Wait
             github.com/influxdata/influxdb/tsdb/index/tsi1.(*LogFile).Close
             github.com/influxdata/influxdb/tsdb/index/tsi1.(*Partition).compactLogFile
             github.com/influxdata/influxdb/tsdb/index/tsi1.(*Partition).compact.func1
        16   runtime.gopark
             runtime.goparkunlock (inline)
             runtime.semacquire1
             sync.runtime_Semacquire
             sync.(*WaitGroup).Wait
             github.com/influxdata/influxdb/tsdb/index/tsi1.(*IndexFile).Close
             github.com/influxdata/influxdb/tsdb/index/tsi1.(*Partition).compactToLevel
             github.com/influxdata/influxdb/tsdb/index/tsi1.(*Partition).compact.func2.1
         1   runtime.gopark
             runtime.chanrecv
             runtime.chanrecv1
             github.com/influxdata/influxdb/tsdb/index/tsi1.(*Partition).Wait
             github.com/influxdata/influxdb/tsdb/index/tsi1.(*Partition).Close
             github.com/influxdata/influxdb/tsdb/index/tsi1.(*Index).close
             github.com/influxdata/influxdb/tsdb/index/tsi1.(*Index).Close
             github.com/influxdata/influxdb/tsdb.(*Shard).closeNoLock
             github.com/influxdata/influxdb/tsdb.(*Shard).Close
             github.com/influxdata/influxdb/tsdb.(*Store).DeleteShard
             github.com/influxdata/influxdb/services/retention.(*Service).DeletionCheck.func3
             github.com/influxdata/influxdb/services/retention.(*Service).DeletionCheck
             github.com/influxdata/influxdb/services/retention.(*Service).run
             github.com/influxdata/influxdb/services/retention.(*Service).Open.func1

Where Wait is here https://github.com/influxdata/influxdb/pull/26294/files#diff-55346f580e7216556be601bef5602df49cf19af75131749c46096475d68126f9R379

I believe that somehow we are infinitely waiting for CurrentCompactionN and yet we are never decrementing to 0. This is causing the retention policy to hang when it gets to Partition.Close

The following PR will not resolve the issue but it will show whether my theory is correct and the root cause of the issue.

This PR makes it easier to debug potential hanging retention service
routines during DeleteShard.
func WaitGroupTimeout(wg *sync.WaitGroup, timeout time.Duration) bool {
c := make(chan struct{})

go func() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

take unidirectional channel as parameter

case <-c:
return false
case <-timer.C:
return true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Log here, but do not exit.

"time"
)

func WaitGroupTimeout(wg *sync.WaitGroup, timeout time.Duration) bool {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would take a zap.Logger and a message here. Print the message at Warn level each time around the loop, with the total elapsed duration as a zap field. No return value - make this like Wait

I would not return on timer ticks; that will change the behavior of the system. Just log the wait.

Style suggestion: take c as an argument to the lambda, restricting its directionality, and renaming it.

Copy link
Contributor

@davidby-influx davidby-influx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In line comments

@devanbenz devanbenz changed the title feat: Add WaitWithTimeout to Partition feat: Add WaitWithTimeout to Partition and WaitGroupTimeout Apr 18, 2025
// If the loop goes for > duration it will return true (timedOut)
// if it does not time out it returns false (!timedOut)
func (p *Partition) WaitWithTimeout(duration time.Duration) bool {
timeout := time.NewTimer(duration)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would get rid of timeout and just have a ticker. In the tick case, first check for p.CurrentCompactionN() == 0 then check if time.Since(startOfMethod) > duration where startOfMethod is a time.Now() on function entry

@devanbenz devanbenz requested a review from davidby-influx May 23, 2025 17:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants