fix: graceful shutdown of bg goroutines before DB close#158
fix: graceful shutdown of bg goroutines before DB close#158Dunsin-cyber wants to merge 2 commits into
Conversation
|
Warning Rate limit exceeded
To keep reviews running without waiting, you can enable usage-based add-on for your organization. This allows additional reviews beyond the hourly cap. Account admins can enable it under billing. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. WalkthroughThe PR introduces a Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~12 minutes 🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Review rate limit: 0/1 reviews remaining, refill in 34 minutes and 39 seconds.Comment |
There was a problem hiding this comment.
Arkana Code Review — #158
The intent is correct: wait for background goroutines to drain before closing the DB. However, the current implementation has a critical deadlock bug and a silent hang risk. Requesting changes.
🔴 CRITICAL: Deadlock when outer goroutine exits early
init.go:101 — a.goroutineWg.Add(4) is called unconditionally before the outer goroutine, but the 4 inner goroutines are only spawned at lines 123-128, after discoverHDWalletKeys, finalizePendingTxs, and refreshDb all succeed.
If discoverHDWalletKeys fails (line 106-110), the outer goroutine returns early. The 4 child goroutines are never launched, so Done() is never called 4 times. Any subsequent call to Stop(), Lock(), or Reset() will call goroutineWg.Wait() and deadlock forever.
// init.go — current code in PR
a.goroutineWg.Add(4) // counter = 4
go func() {
// ...
if _, err := a.discoverHDWalletKeys(ctx); err != nil {
a.syncCh <- err
close(a.syncCh)
return // ← EARLY EXIT: 4 Done()s never happen → DEADLOCK
}
// ... only here are the 4 goroutines spawned
}()Fix: Move Add(4) to just before the 4 go calls (line 122), or add compensating Done() calls on every early-return path. The cleaner approach:
go func() {
a.Explorer().Start()
// ...
if _, err := a.discoverHDWalletKeys(ctx); err != nil {
a.syncCh <- err
close(a.syncCh)
return
}
// ...
a.goroutineWg.Add(4) // ← move here, right before spawning
go a.listenForArkTxs(ctx)
go a.listenForOnchainTxs(ctx)
go a.listenDbEvents(ctx)
go a.periodicRefreshDb(ctx)
}()🔴 HIGH: Shutdown during init phase also deadlocks
Even if discoverHDWalletKeys succeeds, if Lock()/Stop()/Reset() is called while the outer goroutine is still running Explorer().Start(), discoverHDWalletKeys(), or refreshDb() — i.e. before the 4 goroutines are launched — goroutineWg.Wait() will still deadlock because Add(4) already ran but no Done() has been called yet.
This is the same root cause as above: the Add and the go launches are not atomic and can be interrupted by shutdown.
Moving Add(4) to just before the goroutine launches (as shown above) fixes both issues, because if shutdown happens before Add(4), the WaitGroup is still at 0 and Wait() returns immediately.
🟡 MEDIUM: No timeout / safety net on Wait()
If any of the 4 goroutines hangs (e.g. a blocking gRPC stream that doesn't respect context cancellation), goroutineWg.Wait() will block the caller forever. Consider adding a timeout:
done := make(chan struct{})
go func() {
a.goroutineWg.Wait()
close(done)
}()
select {
case <-done:
case <-time.After(10 * time.Second):
log.Warn("timed out waiting for background goroutines to stop")
}🟡 MEDIUM: Missing test coverage
There are no tests for:
Stop()/Lock()whendiscoverHDWalletKeysfails (would catch the deadlock)Stop()called during the init phase (before goroutines are launched)Unlock()→Lock()→Unlock()cycle (WaitGroup reuse correctness)
Given this is fixing a race condition, at minimum a test that calls Stop() shortly after Unlock() (to exercise the early-shutdown path) would be valuable.
✅ Good
defer a.goroutineWg.Done()at the top of each goroutine is correct — covers early returns (e.g.wallet == nilchecks,refreshDbInterval == 0).- All three shutdown paths (
Stop,Lock,Reset) are covered. - The
stopOnceinStop()ensuresWait()isn't called concurrently from multiple callers.
Summary: The approach is sound but the placement of Add(4) creates a guaranteed deadlock on any early-exit path in the outer goroutine. Move Add(4) to immediately before the 4 go calls to fix.
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
client.go (1)
1089-1142:⚠️ Potential issue | 🟠 MajorInner publisher goroutines in
listenDbEventsare not tracked bygoroutineWg.The three
go func()publishers spawned at lines 1108, 1121, and 1134 execute outside the goroutine tracking mechanism. WhenStop()orReset()calla.goroutineWg.Wait(), they return while these publishers may still be publishing events, allowing shutdown to complete before event publication finishes. Adda.goroutineWg.Add(1)before each publisher goroutine, or publish inline.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@client.go` around lines 1089 - 1142, The publisher goroutines inside listenDbEvents are not tracked by a.goroutineWg, so shutdown via Stop()/Reset() can finish before those goroutines complete; fix by incrementing the wait group before spawning each anonymous publisher (call a.goroutineWg.Add(1) immediately before each go func for the UtxoStore, VtxoStore and TransactionStore event handlers) and ensuring each goroutine defers a.goroutineWg.Done() at its start, or alternatively remove the go and publish inline (calling a.utxoBroadcaster.publish, a.vtxoBroadcaster.publish, a.txBroadcaster.publish synchronously) so the work is accounted for by the existing goroutine lifecycle.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@init.go`:
- Around line 101-104: The stopFn is published before the wait-group counter is
incremented, which reintroduces a shutdown race; move the a.goroutineWg.Add(4)
call so it happens before assigning a.stopFn = cancel (i.e., call
a.goroutineWg.Add(4) immediately after creating bgCtx/cancel and before setting
a.stopFn) so concurrent Lock()/Reset()/Stop() that call Wait() cannot observe a
non-nil stopFn while the wait-group is still zero.
---
Outside diff comments:
In `@client.go`:
- Around line 1089-1142: The publisher goroutines inside listenDbEvents are not
tracked by a.goroutineWg, so shutdown via Stop()/Reset() can finish before those
goroutines complete; fix by incrementing the wait group before spawning each
anonymous publisher (call a.goroutineWg.Add(1) immediately before each go func
for the UtxoStore, VtxoStore and TransactionStore event handlers) and ensuring
each goroutine defers a.goroutineWg.Done() at its start, or alternatively remove
the go and publish inline (calling a.utxoBroadcaster.publish,
a.vtxoBroadcaster.publish, a.txBroadcaster.publish synchronously) so the work is
accounted for by the existing goroutine lifecycle.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
| bgCtx, cancel := context.WithCancel(context.Background()) | ||
| a.stopFn = cancel | ||
|
|
||
| a.goroutineWg.Add(4) |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# First, find and examine the init.go file
fd -t f "init.go" | head -5Repository: arkade-os/go-sdk
Length of output: 68
🏁 Script executed:
# Get the full context around the claimed problem area
rg -n "bgCtx.*WithCancel|a\.stopFn|a\.goroutineWg\.Add\(4\)" -B 2 -A 2Repository: arkade-os/go-sdk
Length of output: 795
🏁 Script executed:
# Find the Lock, Reset, and Stop methods mentioned in the review
ast-grep --pattern 'func ($_) Lock() { $$$ }' | head -20Repository: arkade-os/go-sdk
Length of output: 42
🏁 Script executed:
# Also search for Wait() calls and stopFn checks
rg -n "a\.stopFn.*nil|Wait\(\)" -B 2 -A 2Repository: arkade-os/go-sdk
Length of output: 4467
🏁 Script executed:
# Read init.go lines around the problem area
sed -n '80,160p' init.go | cat -nRepository: arkade-os/go-sdk
Length of output: 1798
🏁 Script executed:
# Also check if there are any goroutines already running before line 101
sed -n '1,100p' init.go | cat -nRepository: arkade-os/go-sdk
Length of output: 3392
🏁 Script executed:
# Find the Lock, Reset, and Stop method implementations
rg -n "func \(.*\) (Lock|Reset|Stop)\(" init.go client.go -A 10Repository: arkade-os/go-sdk
Length of output: 1129
Move Add(4) before publishing stopFn.
a.stopFn becomes visible on line 102 before the wait-group counter is incremented on line 104. A concurrent call to Lock(), Reset(), or Stop() can observe a non-nil stopFn, call Wait() while the counter is still 0, and return immediately—allowing teardown to proceed before the startup goroutines are even launched. This reopens the shutdown race this PR is intended to fix.
Suggested fix
bgCtx, cancel := context.WithCancel(context.Background())
- a.stopFn = cancel
-
a.goroutineWg.Add(4)
+ a.stopFn = cancel
go func() {
a.Explorer().Start()📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| bgCtx, cancel := context.WithCancel(context.Background()) | |
| a.stopFn = cancel | |
| a.goroutineWg.Add(4) | |
| bgCtx, cancel := context.WithCancel(context.Background()) | |
| a.goroutineWg.Add(4) | |
| a.stopFn = cancel |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@init.go` around lines 101 - 104, The stopFn is published before the
wait-group counter is incremented, which reintroduces a shutdown race; move the
a.goroutineWg.Add(4) call so it happens before assigning a.stopFn = cancel
(i.e., call a.goroutineWg.Add(4) immediately after creating bgCtx/cancel and
before setting a.stopFn) so concurrent Lock()/Reset()/Stop() that call Wait()
cannot observe a non-nil stopFn while the wait-group is still zero.
There was a problem hiding this comment.
Arkana Code Review — #158 (re-review)
Commit 5307875 addresses both critical findings from my previous review:
-
🔴 Deadlock on early exit — FIXED.
Add(4)moved to right before the 4 goroutine launches, after therefreshDberror check. Early return no longer leaves the WaitGroup stranded. -
🔴 Shutdown during init — FIXED. Outer goroutine tracked with
Add(1)+defer Done().Wait()now correctly handles shutdown at any phase — during init, during listeners, or after completion. -
Bonus fix:
if err != nil { return }prevents launching listener goroutines onrefreshDbfailure — good defensive addition.
The WaitGroup lifecycle is now correct:
Add(1)before outer goroutine →defer Done()in outer goroutineAdd(4)only reached on success, right before 4gocalls → each defersDone()- All 3 shutdown paths (
Lock,Stop,Reset) callstopFn()thenWait()
My two MEDIUM suggestions (timeout on Wait(), test coverage) remain unaddressed but are not blocking. They can be follow-ups.
LGTM. ✅
There was a problem hiding this comment.
Arkana Code Review — #158 (re-review)
Commit 5307875 addresses both critical findings from my previous review:
-
🔴 Deadlock on early exit — FIXED.
Add(4)moved to right before the 4 goroutine launches, after therefreshDberror check. Early return no longer leaves the WaitGroup stranded. -
🔴 Shutdown during init — FIXED. Outer goroutine tracked with
Add(1)+defer Done().Wait()now correctly handles shutdown at any phase — during init, during listeners, or after completion. -
Bonus fix:
if err != nil { return }prevents launching listener goroutines onrefreshDbfailure — good defensive addition.
The WaitGroup lifecycle is now correct:
Add(1)before outer goroutine →defer Done()in outer goroutineAdd(4)only reached on success, right before 4gocalls → each defersDone()- All 3 shutdown paths (
Lock,Stop,Reset) callstopFn()thenWait()
My two MEDIUM suggestions (timeout on Wait(), test coverage) remain unaddressed but are not blocking. They can be follow-ups.
LGTM. ✅
|
@Dunsin-cyber in #153 we're gonna move these listeners into a contract watcher, and therefore this code is going to subject to breaking changes (don't look at it now as it requires indeed changes). If we don't solve directly there, would be better to postpone these fixes after that pr gets merged. |
alright, got it. thanks @altafan |
fixes #154
Added a
goroutineWg sync.WaitGroupto arkClient to track the 4 background goroutines(listenForArkTxs, listenForOnchainTxs, listenDbEvents, periodicRefreshDb). Add(4)is called before the outer goroutine inUnlock()so the counter is set before any shutdown can race it.Summary by CodeRabbit