Skip to content

fix: graceful shutdown of bg goroutines before DB close#158

Open
Dunsin-cyber wants to merge 2 commits into
arkade-os:masterfrom
Dunsin-cyber:fix/DBClose-race-condition
Open

fix: graceful shutdown of bg goroutines before DB close#158
Dunsin-cyber wants to merge 2 commits into
arkade-os:masterfrom
Dunsin-cyber:fix/DBClose-race-condition

Conversation

@Dunsin-cyber
Copy link
Copy Markdown

@Dunsin-cyber Dunsin-cyber commented Apr 29, 2026

fixes #154

Added a goroutineWg sync.WaitGroup to arkClient to track the 4 background goroutines (listenForArkTxs, listenForOnchainTxs, listenDbEvents, periodicRefreshDb). Add(4) is called before the outer goroutine in Unlock() so the counter is set before any shutdown can race it.

Summary by CodeRabbit

  • Bug Fixes
    • Improved shutdown synchronization to ensure background processes properly terminate before lock/unlock transitions complete, enhancing reliability and stability.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 29, 2026

Warning

Rate limit exceeded

@Dunsin-cyber has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 34 minutes and 39 seconds before requesting another review.

To keep reviews running without waiting, you can enable usage-based add-on for your organization. This allows additional reviews beyond the hourly cap. Account admins can enable it under billing.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: ba6c1107-efa3-4510-86e1-931db4f355cd

📥 Commits

Reviewing files that changed from the base of the PR and between 1106180 and 5307875.

📒 Files selected for processing (1)
  • init.go

Walkthrough

The PR introduces a sync.WaitGroup to the arkClient struct to coordinate graceful shutdown of background goroutines before database closure. Four long-running goroutines defer Done() calls to signal completion. During lock, the client signals goroutines to stop and waits for completion before proceeding.

Changes

Cohort / File(s) Summary
Client shutdown synchronization
client.go, init.go
Adds sync.WaitGroup to arkClient and updates initialization/shutdown logic. Goroutines (listenForArkTxs, listenForOnchainTxs, listenDbEvents, periodicRefreshDb) now defer Done() calls. Unlock signals Add(4) before spawning background tasks. Lock blocks on goroutineWg.Wait() after invoking stopFn() to ensure all goroutines complete before database closure.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly describes the main change: implementing graceful shutdown of background goroutines before database closure, which directly addresses the race condition issue.
Linked Issues check ✅ Passed Changes implement graceful shutdown via WaitGroup to coordinate goroutine completion before DB close, directly addressing issue #154's requirement.
Out of Scope Changes check ✅ Passed All changes are focused on adding WaitGroup synchronization to background goroutines, directly related to the linked issue objective.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
Review rate limit: 0/1 reviews remaining, refill in 34 minutes and 39 seconds.

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@arkanaai arkanaai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Arkana Code Review — #158

The intent is correct: wait for background goroutines to drain before closing the DB. However, the current implementation has a critical deadlock bug and a silent hang risk. Requesting changes.


🔴 CRITICAL: Deadlock when outer goroutine exits early

init.go:101a.goroutineWg.Add(4) is called unconditionally before the outer goroutine, but the 4 inner goroutines are only spawned at lines 123-128, after discoverHDWalletKeys, finalizePendingTxs, and refreshDb all succeed.

If discoverHDWalletKeys fails (line 106-110), the outer goroutine returns early. The 4 child goroutines are never launched, so Done() is never called 4 times. Any subsequent call to Stop(), Lock(), or Reset() will call goroutineWg.Wait() and deadlock forever.

// init.go — current code in PR
a.goroutineWg.Add(4)       // counter = 4
go func() {
    // ...
    if _, err := a.discoverHDWalletKeys(ctx); err != nil {
        a.syncCh <- err
        close(a.syncCh)
        return              // ← EARLY EXIT: 4 Done()s never happen → DEADLOCK
    }
    // ... only here are the 4 goroutines spawned
}()

Fix: Move Add(4) to just before the 4 go calls (line 122), or add compensating Done() calls on every early-return path. The cleaner approach:

go func() {
    a.Explorer().Start()
    // ...
    if _, err := a.discoverHDWalletKeys(ctx); err != nil {
        a.syncCh <- err
        close(a.syncCh)
        return
    }
    // ...
    a.goroutineWg.Add(4)  // ← move here, right before spawning
    go a.listenForArkTxs(ctx)
    go a.listenForOnchainTxs(ctx)
    go a.listenDbEvents(ctx)
    go a.periodicRefreshDb(ctx)
}()

🔴 HIGH: Shutdown during init phase also deadlocks

Even if discoverHDWalletKeys succeeds, if Lock()/Stop()/Reset() is called while the outer goroutine is still running Explorer().Start(), discoverHDWalletKeys(), or refreshDb() — i.e. before the 4 goroutines are launched — goroutineWg.Wait() will still deadlock because Add(4) already ran but no Done() has been called yet.

This is the same root cause as above: the Add and the go launches are not atomic and can be interrupted by shutdown.

Moving Add(4) to just before the goroutine launches (as shown above) fixes both issues, because if shutdown happens before Add(4), the WaitGroup is still at 0 and Wait() returns immediately.

🟡 MEDIUM: No timeout / safety net on Wait()

If any of the 4 goroutines hangs (e.g. a blocking gRPC stream that doesn't respect context cancellation), goroutineWg.Wait() will block the caller forever. Consider adding a timeout:

done := make(chan struct{})
go func() {
    a.goroutineWg.Wait()
    close(done)
}()
select {
case <-done:
case <-time.After(10 * time.Second):
    log.Warn("timed out waiting for background goroutines to stop")
}

🟡 MEDIUM: Missing test coverage

There are no tests for:

  1. Stop() / Lock() when discoverHDWalletKeys fails (would catch the deadlock)
  2. Stop() called during the init phase (before goroutines are launched)
  3. Unlock()Lock()Unlock() cycle (WaitGroup reuse correctness)

Given this is fixing a race condition, at minimum a test that calls Stop() shortly after Unlock() (to exercise the early-shutdown path) would be valuable.

✅ Good

  • defer a.goroutineWg.Done() at the top of each goroutine is correct — covers early returns (e.g. wallet == nil checks, refreshDbInterval == 0).
  • All three shutdown paths (Stop, Lock, Reset) are covered.
  • The stopOnce in Stop() ensures Wait() isn't called concurrently from multiple callers.

Summary: The approach is sound but the placement of Add(4) creates a guaranteed deadlock on any early-exit path in the outer goroutine. Move Add(4) to immediately before the 4 go calls to fix.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
client.go (1)

1089-1142: ⚠️ Potential issue | 🟠 Major

Inner publisher goroutines in listenDbEvents are not tracked by goroutineWg.

The three go func() publishers spawned at lines 1108, 1121, and 1134 execute outside the goroutine tracking mechanism. When Stop() or Reset() call a.goroutineWg.Wait(), they return while these publishers may still be publishing events, allowing shutdown to complete before event publication finishes. Add a.goroutineWg.Add(1) before each publisher goroutine, or publish inline.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@client.go` around lines 1089 - 1142, The publisher goroutines inside
listenDbEvents are not tracked by a.goroutineWg, so shutdown via Stop()/Reset()
can finish before those goroutines complete; fix by incrementing the wait group
before spawning each anonymous publisher (call a.goroutineWg.Add(1) immediately
before each go func for the UtxoStore, VtxoStore and TransactionStore event
handlers) and ensuring each goroutine defers a.goroutineWg.Done() at its start,
or alternatively remove the go and publish inline (calling
a.utxoBroadcaster.publish, a.vtxoBroadcaster.publish, a.txBroadcaster.publish
synchronously) so the work is accounted for by the existing goroutine lifecycle.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@init.go`:
- Around line 101-104: The stopFn is published before the wait-group counter is
incremented, which reintroduces a shutdown race; move the a.goroutineWg.Add(4)
call so it happens before assigning a.stopFn = cancel (i.e., call
a.goroutineWg.Add(4) immediately after creating bgCtx/cancel and before setting
a.stopFn) so concurrent Lock()/Reset()/Stop() that call Wait() cannot observe a
non-nil stopFn while the wait-group is still zero.

---

Outside diff comments:
In `@client.go`:
- Around line 1089-1142: The publisher goroutines inside listenDbEvents are not
tracked by a.goroutineWg, so shutdown via Stop()/Reset() can finish before those
goroutines complete; fix by incrementing the wait group before spawning each
anonymous publisher (call a.goroutineWg.Add(1) immediately before each go func
for the UtxoStore, VtxoStore and TransactionStore event handlers) and ensuring
each goroutine defers a.goroutineWg.Done() at its start, or alternatively remove
the go and publish inline (calling a.utxoBroadcaster.publish,
a.vtxoBroadcaster.publish, a.txBroadcaster.publish synchronously) so the work is
accounted for by the existing goroutine lifecycle.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 511ee76b-55d5-45ea-a939-2daef0da77e4

📥 Commits

Reviewing files that changed from the base of the PR and between 2d035b6 and 1106180.

📒 Files selected for processing (2)
  • client.go
  • init.go

Comment thread init.go Outdated
Comment on lines +101 to +104
bgCtx, cancel := context.WithCancel(context.Background())
a.stopFn = cancel

a.goroutineWg.Add(4)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

# First, find and examine the init.go file
fd -t f "init.go" | head -5

Repository: arkade-os/go-sdk

Length of output: 68


🏁 Script executed:

# Get the full context around the claimed problem area
rg -n "bgCtx.*WithCancel|a\.stopFn|a\.goroutineWg\.Add\(4\)" -B 2 -A 2

Repository: arkade-os/go-sdk

Length of output: 795


🏁 Script executed:

# Find the Lock, Reset, and Stop methods mentioned in the review
ast-grep --pattern 'func ($_) Lock() { $$$ }' | head -20

Repository: arkade-os/go-sdk

Length of output: 42


🏁 Script executed:

# Also search for Wait() calls and stopFn checks
rg -n "a\.stopFn.*nil|Wait\(\)" -B 2 -A 2

Repository: arkade-os/go-sdk

Length of output: 4467


🏁 Script executed:

# Read init.go lines around the problem area
sed -n '80,160p' init.go | cat -n

Repository: arkade-os/go-sdk

Length of output: 1798


🏁 Script executed:

# Also check if there are any goroutines already running before line 101
sed -n '1,100p' init.go | cat -n

Repository: arkade-os/go-sdk

Length of output: 3392


🏁 Script executed:

# Find the Lock, Reset, and Stop method implementations
rg -n "func \(.*\) (Lock|Reset|Stop)\(" init.go client.go -A 10

Repository: arkade-os/go-sdk

Length of output: 1129


Move Add(4) before publishing stopFn.

a.stopFn becomes visible on line 102 before the wait-group counter is incremented on line 104. A concurrent call to Lock(), Reset(), or Stop() can observe a non-nil stopFn, call Wait() while the counter is still 0, and return immediately—allowing teardown to proceed before the startup goroutines are even launched. This reopens the shutdown race this PR is intended to fix.

Suggested fix
 	bgCtx, cancel := context.WithCancel(context.Background())
-	a.stopFn = cancel
-
 	a.goroutineWg.Add(4)
+	a.stopFn = cancel
 	go func() {
 		a.Explorer().Start()
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
bgCtx, cancel := context.WithCancel(context.Background())
a.stopFn = cancel
a.goroutineWg.Add(4)
bgCtx, cancel := context.WithCancel(context.Background())
a.goroutineWg.Add(4)
a.stopFn = cancel
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@init.go` around lines 101 - 104, The stopFn is published before the
wait-group counter is incremented, which reintroduces a shutdown race; move the
a.goroutineWg.Add(4) call so it happens before assigning a.stopFn = cancel
(i.e., call a.goroutineWg.Add(4) immediately after creating bgCtx/cancel and
before setting a.stopFn) so concurrent Lock()/Reset()/Stop() that call Wait()
cannot observe a non-nil stopFn while the wait-group is still zero.

Copy link
Copy Markdown

@arkanaai arkanaai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Arkana Code Review — #158 (re-review)

Commit 5307875 addresses both critical findings from my previous review:

  1. 🔴 Deadlock on early exit — FIXED. Add(4) moved to right before the 4 goroutine launches, after the refreshDb error check. Early return no longer leaves the WaitGroup stranded.

  2. 🔴 Shutdown during init — FIXED. Outer goroutine tracked with Add(1) + defer Done(). Wait() now correctly handles shutdown at any phase — during init, during listeners, or after completion.

  3. Bonus fix: if err != nil { return } prevents launching listener goroutines on refreshDb failure — good defensive addition.

The WaitGroup lifecycle is now correct:

  • Add(1) before outer goroutine → defer Done() in outer goroutine
  • Add(4) only reached on success, right before 4 go calls → each defers Done()
  • All 3 shutdown paths (Lock, Stop, Reset) call stopFn() then Wait()

My two MEDIUM suggestions (timeout on Wait(), test coverage) remain unaddressed but are not blocking. They can be follow-ups.

LGTM. ✅

Copy link
Copy Markdown

@arkanaai arkanaai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Arkana Code Review — #158 (re-review)

Commit 5307875 addresses both critical findings from my previous review:

  1. 🔴 Deadlock on early exit — FIXED. Add(4) moved to right before the 4 goroutine launches, after the refreshDb error check. Early return no longer leaves the WaitGroup stranded.

  2. 🔴 Shutdown during init — FIXED. Outer goroutine tracked with Add(1) + defer Done(). Wait() now correctly handles shutdown at any phase — during init, during listeners, or after completion.

  3. Bonus fix: if err != nil { return } prevents launching listener goroutines on refreshDb failure — good defensive addition.

The WaitGroup lifecycle is now correct:

  • Add(1) before outer goroutine → defer Done() in outer goroutine
  • Add(4) only reached on success, right before 4 go calls → each defers Done()
  • All 3 shutdown paths (Lock, Stop, Reset) call stopFn() then Wait()

My two MEDIUM suggestions (timeout on Wait(), test coverage) remain unaddressed but are not blocking. They can be follow-ups.

LGTM. ✅

@Kukks Kukks requested a review from altafan April 29, 2026 09:47
@altafan
Copy link
Copy Markdown
Contributor

altafan commented Apr 30, 2026

@Dunsin-cyber in #153 we're gonna move these listeners into a contract watcher, and therefore this code is going to subject to breaking changes (don't look at it now as it requires indeed changes). If we don't solve directly there, would be better to postpone these fixes after that pr gets merged.

@Dunsin-cyber
Copy link
Copy Markdown
Author

@Dunsin-cyber in #153 we're gonna move these listeners into a contract watcher, and therefore this code is going to subject to breaking changes (don't look at it now as it requires indeed changes). If we don't solve directly there, would be better to postpone these fixes after that pr gets merged.

alright, got it. thanks @altafan

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

DB close race: goroutines still running cause sql: database is closed errors

2 participants