go/worker/storage/committee: Fix teardown #6444

martintomazic · 2026-01-18T22:32:21Z

https://github.com/oasisprotocol/internal-ops/issues/1317#issuecomment-3765815368 showed that in case of corrupted storage, our worker teardown might get stuck.

How to test

Start your node with paratime configured and return a dummy error here.

Prior to this change indexer would continue, whilst storage worker would get stuck at teardown.

It would be nice to have a test for this, but we would need to completely refactor storage worker first. Mainly all the state DB, p2p and other stuff should be passed as parameters, so that errors can be mocked in the "integration" tests.

Previously the fetch pool was closed first, which caused doneCh to never be closed, which cause wg.Wait to never finish. Probably a better approach is to fix workerpool.Pool.

netlify · 2026-01-18T22:32:26Z

✅ Deploy Preview for oasisprotocol-oasis-core canceled.

Name	Link
🔨 Latest commit	`1eeb529`
🔍 Latest deploy log	https://app.netlify.com/projects/oasisprotocol-oasis-core/deploys/696d642ef02d6100081cd8d4

martintomazic · 2026-01-18T22:49:31Z

go/common/workerpool/workerpool.go

+// once the task is complete or the pool is stopped.
 func (p *Pool) Submit(job func()) <-chan struct{} {


I believe correct way (over-engineering) would be:
func (p *Pool) Submit(jobCtx context.Context, job func(ctx context.Context)) <-chan struct{} {

First parameter would be the job context. Pool would also have internal context. The job should be canceled either if the job context was canceled or if the pool context was canceled. Pool context would be canceled by the pool.Stop (which should possibly block until all jobs have finished, or separate method for that).

Current pattern e.g.:

doneCh := fetchPool.Submit(func() { w.fetchDiff(ctx, this.Round, prevRoots[i], this.Roots[i]) }) wg.Go(func() { <-doneCh })

I find non-idiomatic and error prone.

Regardless I would stick to simpler solution either commit 1 or 2.:

In general go routines are cheap so worker pools shouldn't by typical in go? You have counting semaphore pattern and or buffered channels instead to limit concurrency. Worker pool internally also uses apache.channels which are deprecated, and also suffers from unbounded submit issue. Given that we only use it in two places, am not sure this package is actually needed long-term thus the preference for the simpler solutions.

This could remove the need for previous commit.

peternose · 2026-01-20T08:12:29Z

go/worker/storage/committee/worker.go

 	w.status = api.StatusStarting
 	w.statusLock.Unlock()

+	var fetchPool *workerpool.Pool


These changes don't change anything, or am I missing something? And why is this change better? You just introduced that the pool can be nil, which in previous case was not possible.

peternose · 2026-01-20T08:22:40Z

go/common/workerpool/workerpool.go

+	for item := range p.jobCh.Out() { //nolint:revive
+		job := item.(*jobDescriptor)
+		if job.completeCh != nil {
+			close(job.completeCh)


Now, you are using completeCh for 2 things: job completion and job cancellation, and users cannot distinguish them.

go/worker/storage/committee: Fix worker teardown

079e8a4

Previously the fetch pool was closed first, which caused doneCh to never be closed, which cause wg.Wait to never finish. Probably a better approach is to fix workerpool.Pool.

martintomazic commented Jan 18, 2026

View reviewed changes

go/common/workerpool: Fix workerpool (tentative)

1eeb529

This could remove the need for previous commit.

martintomazic force-pushed the martin/fix/storage-commitee-worker-teardown branch from c5b803a to 1eeb529 Compare January 18, 2026 22:52

martintomazic marked this pull request as ready for review January 19, 2026 08:58

martintomazic requested review from kostko, peternose and ptrus as code owners January 19, 2026 08:58

peternose reviewed Jan 20, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

go/worker/storage/committee: Fix teardown #6444

go/worker/storage/committee: Fix teardown #6444

martintomazic commented Jan 18, 2026 •

edited

Loading

Uh oh!

netlify bot commented Jan 18, 2026 •

edited

Loading

Uh oh!

martintomazic Jan 18, 2026 •

edited

Loading

Uh oh!

peternose Jan 20, 2026

Uh oh!

peternose Jan 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		// once the task is complete or the pool is stopped.
		func (p *Pool) Submit(job func()) <-chan struct{} {

go/worker/storage/committee: Fix teardown #6444

Are you sure you want to change the base?

go/worker/storage/committee: Fix teardown #6444

Conversation

martintomazic commented Jan 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

How to test

Uh oh!

netlify bot commented Jan 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for oasisprotocol-oasis-core canceled.

Uh oh!

martintomazic Jan 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

peternose Jan 20, 2026

Choose a reason for hiding this comment

Uh oh!

peternose Jan 20, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

martintomazic commented Jan 18, 2026 •

edited

Loading

netlify bot commented Jan 18, 2026 •

edited

Loading

martintomazic Jan 18, 2026 •

edited

Loading