-
Notifications
You must be signed in to change notification settings - Fork 18
docs: update devnet-runner skill with information on restarting nodes #209
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
2 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
122 changes: 122 additions & 0 deletions
122
.claude/skills/devnet-runner/references/checkpoint-sync.md
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,122 @@ | ||
| # Checkpoint Sync in Devnets | ||
|
|
||
| Restarting a node with checkpoint sync instead of replaying from genesis. Useful for testing checkpoint sync itself, upgrading a node's image mid-devnet, or recovering a crashed node. | ||
|
|
||
| ## When to Use | ||
|
|
||
| - Testing checkpoint sync behavior (interop, verification, catch-up) | ||
| - Replacing a node's Docker image mid-run (e.g., testing a new build) | ||
| - Recovering a node that fell behind or crashed | ||
|
|
||
| ## Prerequisites | ||
|
|
||
| - A running devnet with at least one healthy node to serve the checkpoint state | ||
| - The checkpoint source node's RPC must be reachable (same port as `--metrics-port`) | ||
|
|
||
| ## Key Concepts | ||
|
|
||
| **RPC and metrics share the same port.** ethlambda serves both Prometheus metrics (`/metrics`) and the Lean API (`/lean/v0/...`) on the `--metrics-port`. There is no separate RPC port. | ||
|
|
||
| **Checkpoint sync URL format:** | ||
| ``` | ||
| http://<host>:<metrics-port>/lean/v0/states/finalized | ||
| ``` | ||
|
|
||
| **The node must have the same genesis config.** Checkpoint sync verifies the downloaded state against the local genesis config (genesis time, validator pubkeys, validator count). The `--custom-network-config-dir` must point to the same genesis used by the rest of the devnet. | ||
|
|
||
| ## Restart Procedure | ||
|
|
||
| **Restart nodes one at a time.** Wait for each node to fully sync and rejoin consensus before restarting the next. 3SF-mini requires 2/3+ of validators to vote in order to justify checkpoints and advance finalization. If 1/3 or more validators are offline simultaneously, finalization stalls until enough nodes come back online. | ||
|
|
||
| ### Step 1: Choose the node to restart | ||
|
|
||
| Any node can be restarted, but be aware that restarting the aggregator node will stop finalization and attestation inclusion in blocks until it catches back up to head. Check which node is the aggregator in `validator-config.yaml`: | ||
| ```yaml | ||
| # In lean-quickstart/<network-dir>/genesis/validator-config.yaml | ||
| validators: | ||
| - name: "ethlambda_0" | ||
| isAggregator: false | ||
| - name: "ethlambda_2" | ||
| isAggregator: true # restarting this stops finalization until it catches up | ||
| ``` | ||
|
|
||
| ### Step 2: Identify a checkpoint source | ||
|
|
||
| Pick any other running node's metrics port as the checkpoint source. The port is configured as `metricsPort` in `validator-config.yaml`. | ||
|
|
||
| For local devnets (host networking), the URL is: | ||
| ``` | ||
| http://127.0.0.1:<metrics-port>/lean/v0/states/finalized | ||
| ``` | ||
|
|
||
| Verify the endpoint is reachable: | ||
| ```bash | ||
| curl -s http://127.0.0.1:<metrics-port>/lean/v0/health | ||
| # Should return: {"status":"healthy","service":"lean-spec-api"} | ||
| ``` | ||
|
|
||
| ### Step 3: Update the Docker image tag (if changing versions) | ||
|
|
||
| Edit `lean-quickstart/client-cmds/<client>-cmd.sh` and change the image tag in `node_docker` before restarting: | ||
| ```bash | ||
| # In lean-quickstart/client-cmds/ethlambda-cmd.sh, change: | ||
| node_docker="ghcr.io/lambdaclass/ethlambda:local \ | ||
| # To: | ||
| node_docker="ghcr.io/lambdaclass/ethlambda:devnet3 \ | ||
| ``` | ||
|
|
||
| ### Step 4: Pull the new Docker image | ||
|
|
||
| **Pull the image before restarting** to minimize how long the node is absent from the network. If you skip this, `spin-node.sh` will pull during restart, adding minutes of downtime where the node misses proposer slots and attestation duties: | ||
| ```bash | ||
| docker pull <image>:<new_tag> | ||
| ``` | ||
|
|
||
| ### Step 5: Restart with checkpoint sync | ||
|
|
||
| ```bash | ||
| cd lean-quickstart && NETWORK_DIR=local-devnet ./spin-node.sh \ | ||
| --restart-client <node_name> \ | ||
| --checkpoint-sync-url http://127.0.0.1:<source_metrics_port>/lean/v0/states/finalized | ||
| ``` | ||
|
|
||
| This automatically: | ||
| 1. Stops the existing container | ||
| 2. Clears the data directory | ||
| 3. Pulls the Docker image (skipped if already present locally) | ||
| 4. Restarts with `--checkpoint-sync-url` passed to the node | ||
|
|
||
| If `--checkpoint-sync-url` is omitted, it defaults to `https://leanpoint.leanroadmap.org/lean/v0/states/finalized` (the public checkpoint provider). | ||
|
|
||
| Multiple nodes can be restarted at once with comma-separated names: | ||
| ```bash | ||
| --restart-client ethlambda_0,ethlambda_3 | ||
| ``` | ||
|
|
||
| ### Step 6: Verify the node synced | ||
|
|
||
| ```bash | ||
| docker logs --tail 20 <node_name> | ||
| ``` | ||
|
|
||
| Look for: | ||
| - "Block imported successfully" messages catching up to the current slot | ||
| - "Fork Choice Tree" showing finalized/justified/head slots close to the network's current state | ||
| - No error messages about verification failures or SSZ decode errors | ||
|
|
||
| ## Troubleshooting | ||
|
|
||
| ### "genesis time mismatch" or "validator count mismatch" | ||
| The checkpoint source is running a different genesis than the restarting node. Ensure both use the same genesis config directory. | ||
|
|
||
| ### "HTTP request failed" or connection refused | ||
| The checkpoint source node is down or unreachable. Verify with `curl` that the source endpoint returns a healthy response. | ||
|
|
||
| ### Node exits immediately after start | ||
| Check `docker logs <node_name>` for verification errors. Checkpoint sync exits on any failure without modifying the database, so it's safe to retry. | ||
|
|
||
| ### Node syncs but doesn't finalize | ||
| If the restarted node is the aggregator, attestations won't be aggregated and blocks will be produced with `attestation_count=0` until it catches back up to head. Finalization resumes once the aggregator is fully synced and participating in consensus again. | ||
|
|
||
| ### "Fallback pruning (finalization stalled)" after catch-up | ||
| Normal during catch-up. The node accumulated blocks faster than finalization can advance. This resolves once the node is fully caught up and participating in consensus. | ||
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
isAggregatorfield undocumented in schema referenceThe
isAggregatorfield shown in this YAML snippet doesn't appear anywhere else in the repository — it's absent from the validator-config schema table inSKILL.md(the "full schema" and "Field reference" sections). This creates an inconsistency: a reader followingSKILL.mdwould not know this field exists, and might think the example here is illustrative/fictional rather than a real config field.If
isAggregatoris a real field supported byspin-node.sh, it should be added to the schema documentation inSKILL.md. If it isn't a real field and the aggregator role is determined some other way (e.g., alphabetically or by validator index), the example should be updated to reflect the actual mechanism.Prompt To Fix With AI