feat: Add WaitWithTimeout to Partition and WaitGroupTimeout #26294

devanbenz · 2025-04-18T18:24:01Z

This PR makes it easier to debug potential hanging retention service routines during DeleteShard.

Currently we are seeing the following traces within goroutine profiles for customers that are experiences issues where shards are persisting after the retention policy.

      1103   runtime.gopark
             runtime.selectgo
             github.com/influxdata/influxdb/tsdb/index/tsi1.(*Partition).runPeriodicCompaction

        32   runtime.gopark
             runtime.goparkunlock (inline)
             runtime.semacquire1
             sync.runtime_Semacquire
             sync.(*WaitGroup).Wait
             github.com/influxdata/influxdb/tsdb/index/tsi1.(*LogFile).Close
             github.com/influxdata/influxdb/tsdb/index/tsi1.(*Partition).compactLogFile
             github.com/influxdata/influxdb/tsdb/index/tsi1.(*Partition).compact.func1

        16   runtime.gopark
             runtime.goparkunlock (inline)
             runtime.semacquire1
             sync.runtime_Semacquire
             sync.(*WaitGroup).Wait
             github.com/influxdata/influxdb/tsdb/index/tsi1.(*IndexFile).Close
             github.com/influxdata/influxdb/tsdb/index/tsi1.(*Partition).compactToLevel
             github.com/influxdata/influxdb/tsdb/index/tsi1.(*Partition).compact.func2.1

         1   runtime.gopark
             runtime.chanrecv
             runtime.chanrecv1
             github.com/influxdata/influxdb/tsdb/index/tsi1.(*Partition).Wait
             github.com/influxdata/influxdb/tsdb/index/tsi1.(*Partition).Close
             github.com/influxdata/influxdb/tsdb/index/tsi1.(*Index).close
             github.com/influxdata/influxdb/tsdb/index/tsi1.(*Index).Close
             github.com/influxdata/influxdb/tsdb.(*Shard).closeNoLock
             github.com/influxdata/influxdb/tsdb.(*Shard).Close
             github.com/influxdata/influxdb/tsdb.(*Store).DeleteShard
             github.com/influxdata/influxdb/services/retention.(*Service).DeletionCheck.func3
             github.com/influxdata/influxdb/services/retention.(*Service).DeletionCheck
             github.com/influxdata/influxdb/services/retention.(*Service).run
             github.com/influxdata/influxdb/services/retention.(*Service).Open.func1

Where Wait is here https://github.com/influxdata/influxdb/pull/26294/files#diff-55346f580e7216556be601bef5602df49cf19af75131749c46096475d68126f9R379

I believe that somehow we are infinitely waiting for CurrentCompactionN and yet we are never decrementing to 0. This is causing the retention policy to hang when it gets to Partition.Close

The following PR will not resolve the issue but it will show whether my theory is correct and the root cause of the issue.

This PR makes it easier to debug potential hanging retention service routines during DeleteShard.

davidby-influx · 2025-04-18T18:49:11Z

pkg/wg_timeout/wg_timeout.go

+func WaitGroupTimeout(wg *sync.WaitGroup, timeout time.Duration) bool {
+	c := make(chan struct{})
+
+	go func() {


take unidirectional channel as parameter

davidby-influx · 2025-04-18T18:49:37Z

pkg/wg_timeout/wg_timeout.go

+	case <-c:
+		return false
+	case <-timer.C:
+		return true


Log here, but do not exit.

davidby-influx · 2025-04-18T18:51:22Z

pkg/wg_timeout/wg_timeout.go

+	"time"
+)
+
+func WaitGroupTimeout(wg *sync.WaitGroup, timeout time.Duration) bool {


I would take a zap.Logger and a message here. Print the message at Warn level each time around the loop, with the total elapsed duration as a zap field. No return value - make this like Wait

I would not return on timer ticks; that will change the behavior of the system. Just log the wait.

Style suggestion: take c as an argument to the lambda, restricting its directionality, and renaming it.

davidby-influx

In line comments

davidby-influx · 2025-05-22T20:44:17Z

tsdb/index/tsi1/partition.go

+// If the loop goes for > duration it will return true (timedOut)
+// if it does not time out it returns false (!timedOut)
+func (p *Partition) WaitWithTimeout(duration time.Duration) bool {
+	timeout := time.NewTimer(duration)


I would get rid of timeout and just have a ticker. In the tick case, first check for p.CurrentCompactionN() == 0 then check if time.Since(startOfMethod) > duration where startOfMethod is a time.Now() on function entry

devanbenz added 2 commits April 18, 2025 13:22

feat: Add WaitWithTimeout to Partition

df86dfc

This PR makes it easier to debug potential hanging retention service routines during DeleteShard.

feat: add wg_timeout package and wrap Close with timeout

961c19a

davidby-influx reviewed Apr 18, 2025

View reviewed changes

devanbenz added 5 commits April 18, 2025 14:21

feat: pass in emitter that can be used for timeouts

d7482b0

fix: checkfmt

e59a96b

Merge branch 'master-1.x' into db/wait-timeout-utility

6c68752

feat: Pass in a channel

3f500b0

feat: rename channel and pass as arg into go lambda

586ab5a

devanbenz changed the title ~~feat: Add WaitWithTimeout to Partition~~ feat: Add WaitWithTimeout to Partition and WaitGroupTimeout Apr 18, 2025

fix: Use 10*millisecond for ticker

eb3e879

davidby-influx reviewed May 22, 2025

View reviewed changes

devanbenz added 3 commits May 22, 2025 16:12

feat: Merge branch 'master-1.x' into db/wait-timeout-utility

de93bab

feat: use a debug log

035b66c

feat: Update to use debug logger

28ffa95

devanbenz requested a review from davidby-influx May 23, 2025 17:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: Add WaitWithTimeout to Partition and WaitGroupTimeout #26294

feat: Add WaitWithTimeout to Partition and WaitGroupTimeout #26294

Uh oh!

devanbenz commented Apr 18, 2025 •

edited

Loading

Uh oh!

davidby-influx Apr 18, 2025

Uh oh!

davidby-influx Apr 18, 2025

Uh oh!

davidby-influx Apr 18, 2025

Uh oh!

davidby-influx left a comment

Uh oh!

davidby-influx May 22, 2025

Uh oh!

Uh oh!

feat: Add WaitWithTimeout to Partition and WaitGroupTimeout #26294

Are you sure you want to change the base?

feat: Add WaitWithTimeout to Partition and WaitGroupTimeout #26294

Uh oh!

Conversation

devanbenz commented Apr 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

davidby-influx Apr 18, 2025

Choose a reason for hiding this comment

Uh oh!

davidby-influx Apr 18, 2025

Choose a reason for hiding this comment

Uh oh!

davidby-influx Apr 18, 2025

Choose a reason for hiding this comment

Uh oh!

davidby-influx left a comment

Choose a reason for hiding this comment

Uh oh!

davidby-influx May 22, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

devanbenz commented Apr 18, 2025 •

edited

Loading