Skip to content

fix(client): race conditions in AutoTLS initialization#80

Draft
lidel wants to merge 3 commits intomainfrom
fix/autotls-init-race
Draft

fix(client): race conditions in AutoTLS initialization#80
lidel wants to merge 3 commits intomainfrom
fix/autotls-init-race

Conversation

@lidel
Copy link
Contributor

@lidel lidel commented Nov 28, 2025

Warning

Do not merge, not ready for review, still investigating.

Two race conditions could cause forge (/dns../ws/.libp2p.direct) addresses to not appear:

  1. hasHost() returning false when address factory is called

    • provideHost now calls hostFn() immediately, ensuring hasHost() returns true right after ProvideHost() is called
    • prevents race where address factory runs before Start() goroutine
  2. hasCert not being set when cert already exists

    • cached_managed_cert event only fires when cert is NEW to cache
    • if cert was already in certmagic's in-memory cache, no event fires
    • now explicitly set hasCert=true after ManageAsync() when we know cert existed in storage beforehand

TODO

  • fix race 1
  • fix race 2
  • tests
  • manual with go-libp2p/examples/autotls (with and without cached cert)
    • ⚠ ️this does not seem to be enough to fix issue, deeper refactor is necessary due to how libp2p adds/removes adds based on autonat
  • manual e2e with kubo

Two race conditions could cause forge addresses to not appear:

1. hasHost() returning false when address factory is called
   - provideHost now calls hostFn() immediately, ensuring hasHost()
     returns true right after ProvideHost() is called
   - prevents race where address factory runs before Start() goroutine

2. hasCert not being set when cert already exists
   - cached_managed_cert event only fires when cert is NEW to cache
   - if cert was already in certmagic's in-memory cache, no event fires
   - now explicitly set hasCert=true after ManageAsync() when we know
     cert existed in storage beforehand
@codecov
Copy link

codecov bot commented Nov 28, 2025

Codecov Report

❌ Patch coverage is 36.90476% with 53 lines in your changes missing coverage. Please review.
✅ Project coverage is 68.28%. Comparing base (a6a2269) to head (e382fa6).
⚠️ Report is 2 commits behind head on main.

Files with missing lines Patch % Lines
client/acme.go 36.90% 48 Missing and 5 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main      #80      +/-   ##
==========================================
+ Coverage   64.97%   68.28%   +3.31%     
==========================================
  Files          12       21       +9     
  Lines        1062     1668     +606     
==========================================
+ Hits          690     1139     +449     
- Misses        292      410     +118     
- Partials       80      119      +39     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

- TestHasHostTrueImmediatelyAfterProvideHost: verify hostFn is resolved
  immediately after ProvideHost, no deadlock when AddressFactory called
- TestAddressFactoryOrderIndependence: verify factory works before/after
  ProvideHost without deadlocking
- TestAddressFactoryConcurrentAccess: verify no data races with concurrent
  AddressFactory calls (run with -race flag)
- TestProvideHostIdempotent: verify second ProvideHost doesn't deadlock
  and first host wins via sync.OnceValue
The previous fix (5be6b87) addressed race conditions but had gaps:

1. onCertLoaded callback could fire before reachability was confirmed
   - OnEvent handler called callback directly, racing with Start()
   - now OnEvent only signals via channel, Start() orchestrates timing

2. cert renewal wasn't detected as needing reachability check
   - only checked if cert exists, not if it needs renewal
   - certs in renewal window should wait for AutoNAT like new certs
   - added localCertNeedsRenewal() to detect this case

3. cached cert path blocked unnecessarily on reachability
   - cert already exists, no need to verify reachability again
   - callback now fires immediately after loading cert from cache
   - address factory's ConfirmedAddrs() dynamically filters unreachable addrs

4. simplified to AutoNAT v2 only (EvtHostReachableAddrsChanged)
   - v2 provides per-address reachability which is what p2p-forge needs
   - v2 is enabled by default in modern go-libp2p

5. cert check failures were silent
   - converted to methods to use instance logger (respects WithLogger)
   - added ERROR logs for unexpected failures to aid debugging
@lidel lidel force-pushed the fix/autotls-init-race branch from 8b36ea4 to e382fa6 Compare December 5, 2025 23:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant