Skip to content

eth/protocols/wit: remove peer lock held across p2p.Send to prevent broadcast stalls and added metrics#2120

Merged
pratikspatil024 merged 4 commits intodevelopfrom
psps-wit-improvements
Mar 11, 2026
Merged

eth/protocols/wit: remove peer lock held across p2p.Send to prevent broadcast stalls and added metrics#2120
pratikspatil024 merged 4 commits intodevelopfrom
psps-wit-improvements

Conversation

@pratikspatil024
Copy link
Member

@pratikspatil024 pratikspatil024 commented Mar 9, 2026

Description

  • Remove p.lock from the WIT protocol peer, it was held across synchronous p2p.Send() calls (up to 20s write timeout), which could block the block broadcast loop when a peer was slow or disconnecting
  • Make KnownCache thread-safe internally so external locking is no longer needed
  • Add metrics for observability: wit/peer/send_witness_duration and eth/broadcast_loop_duration

Changes

  • Bugfix (non-breaking change that solves an issue)
  • Hotfix (change that solves an urgent issue, and requires immediate attention)
  • New feature (non-breaking change that adds functionality)
  • Breaking change (change that is not backwards-compatible and/or changes current functionality)
  • Changes only for a subset of nodes

Breaking changes

Please complete this section if any breaking changes have been made, otherwise delete it

Nodes audience

In case this PR includes changes that must be applied only to a subset of nodes, please specify how you handled it (e.g. by adding a flag with a default value...)

Checklist

  • I have added at least 2 reviewer or the whole pos-v1 team
  • I have added sufficient documentation in code
  • I will be resolving comments - if any - by pushing each fix in a separate commit and linking the commit hash in the comment reply
  • Created a task in Jira and informed the team for implementation in Erigon client (if applicable)
  • Includes RPC methods changes, and the Notion documentation has been updated

Cross repository changes

  • This PR requires changes to heimdall
    • In case link the PR here:
  • This PR requires changes to matic-cli
    • In case link the PR here:

Testing

  • I have added unit tests
  • I have added tests to CI
  • I have tested this code manually on local environment
  • I have tested this code manually on remote devnet using express-cli
  • I have tested this code manually on amoy
  • I have created new e2e tests into express-cli

Manual tests

Please complete this section with the steps you performed if you ran manual tests for this functionality, otherwise delete it

Additional comments

Please post additional comments in this section if you have them, otherwise delete it

@pratikspatil024 pratikspatil024 requested review from a team, cffls and lucca30 March 9, 2026 08:07
@codecov
Copy link

codecov bot commented Mar 9, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 51.46%. Comparing base (a328aa9) to head (1ceb09b).
⚠️ Report is 6 commits behind head on develop.

Additional details and impacted files

Impacted file tree graph

@@             Coverage Diff             @@
##           develop    #2120      +/-   ##
===========================================
+ Coverage    51.44%   51.46%   +0.02%     
===========================================
  Files          881      882       +1     
  Lines       153870   153975     +105     
===========================================
+ Hits         79157    79247      +90     
- Misses       69542    69546       +4     
- Partials      5171     5182      +11     
Files with missing lines Coverage Δ
eth/handler.go 63.58% <100.00%> (+0.10%) ⬆️
eth/protocols/wit/peer.go 54.54% <100.00%> (+7.30%) ⬆️

... and 24 files with indirect coverage changes

Files with missing lines Coverage Δ
eth/handler.go 63.58% <100.00%> (+0.10%) ⬆️
eth/protocols/wit/peer.go 54.54% <100.00%> (+7.30%) ⬆️

... and 24 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@kamuikatsurgi
Copy link
Member

Looks like erigon got stuck at block 256 in the E2E test. Could you take a look and see if these changes did not affect anything?

@adamdossa adamdossa force-pushed the psps-wit-improvements branch from 36c9229 to c5a59a8 Compare March 9, 2026 11:29
@adamdossa
Copy link
Contributor

@claude Can you do a detailed review of this PR.

@claude
Copy link

claude bot commented Mar 9, 2026

Claude encountered an error —— View job


I'll analyze this and get back to you.

@claude
Copy link

claude bot commented Mar 10, 2026

Code review

No issues found. Checked for bugs and CLAUDE.md compliance.

@jhkimqd
Copy link
Contributor

jhkimqd commented Mar 11, 2026

Quick question @pratikspatil024 - I could be missing something, but while testing this on Kurtosis-PoS, it seems like the wit/peer/send_witness_duration metric is not incrementing - I have a mixture of Erigon, Archive nodes, RPCs snycing/producing witnesses, etc.. Would we need to enable some specific metric to see this count increase?

For the functionality, I've tested on separate Kurtosis devnets on Bor images with and without this fix (along with a custom timeout added to simulate slow p2p.Send()), and I was seeing different behaviors - span rotations on the previous images, and normal network progression with the fix which seems to be expected!

@pratikspatil024
Copy link
Member Author

Awesome, thanks for looking into this and testing this fix, @jhkimqd!

Looking at the wit/peer/send_witness_duration metric not increasing issue, I back tracked this part of the code where the metric is set:

		case witness := <-p.queuedWitness:
			start := time.Now()
			if err := p.sendNewWitness(witness); err != nil {
				return
			}
			witnessSendTimer.Update(time.Since(start))
			p.logger.Debug("propagated witness", "hash", witness.Header().Hash())

and found out that the entry point which pushed the witnesses into the queuedWitness channel is never used!!

// TODO(@pratikspatil024) - use this to broadcast the witness
// BroadcastWitness broadcasts the witness to all peers who are not aware of it
func (h *handler) BroadcastWitness(witness *stateless.Witness) {
	// broadcast the witness to all peers who are not aware of it
	for _, peer := range h.peers.peersWithoutWitness(witness.Headers[0].Hash()) {
		peer.Peer.AsyncSendNewWitness(witness)
	}
}

So the metric not increasing makes sense. The delay in p2p.Send() would not be from this BroadcastWitness but from responding to a witness fetching request.

@sonarqubecloud
Copy link

@pratikspatil024 pratikspatil024 merged commit 89ecf6e into develop Mar 11, 2026
27 of 29 checks passed
@pratikspatil024 pratikspatil024 deleted the psps-wit-improvements branch March 11, 2026 08:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants