Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
29 changes: 27 additions & 2 deletions .claude/skills/devnet-runner/SKILL.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
---
name: devnet-runner
description: Manage local development networks for lean consensus testing. Use when users want to (1) Configure a devnet with validator nodes, (2) Start/stop devnet nodes, (3) Regenerate genesis files, (4) Collect and dump node logs to files, (5) Troubleshoot devnet issues.
description: Manage local development networks for lean consensus testing. Use when users want to (1) Configure a devnet with validator nodes, (2) Start/stop devnet nodes, (3) Regenerate genesis files, (4) Collect and dump node logs to files, (5) Troubleshoot devnet issues, (6) Restart a node with checkpoint sync.
---

# Devnet Runner
Expand Down Expand Up @@ -390,6 +390,31 @@ Use the `run-devnet-with-timeout.sh` script for timed runs. Remember to include
|--------|-------------|
| `scripts/run-devnet-with-timeout.sh <seconds>` | Run devnet for specified duration, dump logs to repo root, then stop |

## Restarting a Node with Checkpoint Sync

To restart a single node mid-devnet (e.g., to test a new image or checkpoint sync itself):

**Important:** Restart nodes one at a time, waiting for each to fully sync before restarting the next. If 1/3 or more validators are offline simultaneously, finalization stalls because 3SF-mini requires 2/3+ votes to justify checkpoints.

1. Choose a node to restart. If restarting the aggregator, finalization and attestation inclusion in blocks will stop until it catches back up to head.
2. Identify a healthy node's metrics port to use as checkpoint source
3. Update the Docker image tag in `client-cmds/<client>-cmd.sh` if needed
4. **Pull the new image before restarting** to minimize node downtime:
```bash
docker pull <image>:<tag>
```
5. Restart with checkpoint sync:
```bash
cd lean-quickstart && NETWORK_DIR=local-devnet ./spin-node.sh \
--restart-client <node_name> \
--checkpoint-sync-url http://127.0.0.1:<source_metrics_port>/lean/v0/states/finalized
```

**Important:** RPC and metrics share the same port (`--metrics-port`). There is no separate RPC port.

See `references/checkpoint-sync.md` for the full procedure, verification steps, and troubleshooting.

## Reference

See `references/clients.md` for client-specific details (images, ports, configurations).
- `references/clients.md`: Client-specific details (images, ports, configurations)
- `references/checkpoint-sync.md`: Restarting nodes with checkpoint sync
122 changes: 122 additions & 0 deletions .claude/skills/devnet-runner/references/checkpoint-sync.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,122 @@
# Checkpoint Sync in Devnets

Restarting a node with checkpoint sync instead of replaying from genesis. Useful for testing checkpoint sync itself, upgrading a node's image mid-devnet, or recovering a crashed node.

## When to Use

- Testing checkpoint sync behavior (interop, verification, catch-up)
- Replacing a node's Docker image mid-run (e.g., testing a new build)
- Recovering a node that fell behind or crashed

## Prerequisites

- A running devnet with at least one healthy node to serve the checkpoint state
- The checkpoint source node's RPC must be reachable (same port as `--metrics-port`)

## Key Concepts

**RPC and metrics share the same port.** ethlambda serves both Prometheus metrics (`/metrics`) and the Lean API (`/lean/v0/...`) on the `--metrics-port`. There is no separate RPC port.

**Checkpoint sync URL format:**
```
http://<host>:<metrics-port>/lean/v0/states/finalized
```

**The node must have the same genesis config.** Checkpoint sync verifies the downloaded state against the local genesis config (genesis time, validator pubkeys, validator count). The `--custom-network-config-dir` must point to the same genesis used by the rest of the devnet.

## Restart Procedure

**Restart nodes one at a time.** Wait for each node to fully sync and rejoin consensus before restarting the next. 3SF-mini requires 2/3+ of validators to vote in order to justify checkpoints and advance finalization. If 1/3 or more validators are offline simultaneously, finalization stalls until enough nodes come back online.

### Step 1: Choose the node to restart

Any node can be restarted, but be aware that restarting the aggregator node will stop finalization and attestation inclusion in blocks until it catches back up to head. Check which node is the aggregator in `validator-config.yaml`:
```yaml
# In lean-quickstart/<network-dir>/genesis/validator-config.yaml
validators:
- name: "ethlambda_0"
isAggregator: false
- name: "ethlambda_2"
isAggregator: true # restarting this stops finalization until it catches up
```
Comment on lines +34 to +41
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isAggregator field undocumented in schema reference

The isAggregator field shown in this YAML snippet doesn't appear anywhere else in the repository — it's absent from the validator-config schema table in SKILL.md (the "full schema" and "Field reference" sections). This creates an inconsistency: a reader following SKILL.md would not know this field exists, and might think the example here is illustrative/fictional rather than a real config field.

If isAggregator is a real field supported by spin-node.sh, it should be added to the schema documentation in SKILL.md. If it isn't a real field and the aggregator role is determined some other way (e.g., alphabetically or by validator index), the example should be updated to reflect the actual mechanism.

Prompt To Fix With AI
This is a comment left during a code review.
Path: .claude/skills/devnet-runner/references/checkpoint-sync.md
Line: 34-41

Comment:
**`isAggregator` field undocumented in schema reference**

The `isAggregator` field shown in this YAML snippet doesn't appear anywhere else in the repository — it's absent from the validator-config schema table in `SKILL.md` (the "full schema" and "Field reference" sections). This creates an inconsistency: a reader following `SKILL.md` would not know this field exists, and might think the example here is illustrative/fictional rather than a real config field.

If `isAggregator` is a real field supported by `spin-node.sh`, it should be added to the schema documentation in `SKILL.md`. If it isn't a real field and the aggregator role is determined some other way (e.g., alphabetically or by validator index), the example should be updated to reflect the actual mechanism.

How can I resolve this? If you propose a fix, please make it concise.


### Step 2: Identify a checkpoint source

Pick any other running node's metrics port as the checkpoint source. The port is configured as `metricsPort` in `validator-config.yaml`.

For local devnets (host networking), the URL is:
```
http://127.0.0.1:<metrics-port>/lean/v0/states/finalized
```

Verify the endpoint is reachable:
```bash
curl -s http://127.0.0.1:<metrics-port>/lean/v0/health
# Should return: {"status":"healthy","service":"lean-spec-api"}
```

### Step 3: Update the Docker image tag (if changing versions)

Edit `lean-quickstart/client-cmds/<client>-cmd.sh` and change the image tag in `node_docker` before restarting:
```bash
# In lean-quickstart/client-cmds/ethlambda-cmd.sh, change:
node_docker="ghcr.io/lambdaclass/ethlambda:local \
# To:
node_docker="ghcr.io/lambdaclass/ethlambda:devnet3 \
```

### Step 4: Pull the new Docker image

**Pull the image before restarting** to minimize how long the node is absent from the network. If you skip this, `spin-node.sh` will pull during restart, adding minutes of downtime where the node misses proposer slots and attestation duties:
```bash
docker pull <image>:<new_tag>
```

### Step 5: Restart with checkpoint sync

```bash
cd lean-quickstart && NETWORK_DIR=local-devnet ./spin-node.sh \
--restart-client <node_name> \
--checkpoint-sync-url http://127.0.0.1:<source_metrics_port>/lean/v0/states/finalized
```

This automatically:
1. Stops the existing container
2. Clears the data directory
3. Pulls the Docker image (skipped if already present locally)
4. Restarts with `--checkpoint-sync-url` passed to the node

If `--checkpoint-sync-url` is omitted, it defaults to `https://leanpoint.leanroadmap.org/lean/v0/states/finalized` (the public checkpoint provider).

Multiple nodes can be restarted at once with comma-separated names:
```bash
--restart-client ethlambda_0,ethlambda_3
```

### Step 6: Verify the node synced

```bash
docker logs --tail 20 <node_name>
```

Look for:
- "Block imported successfully" messages catching up to the current slot
- "Fork Choice Tree" showing finalized/justified/head slots close to the network's current state
- No error messages about verification failures or SSZ decode errors

## Troubleshooting

### "genesis time mismatch" or "validator count mismatch"
The checkpoint source is running a different genesis than the restarting node. Ensure both use the same genesis config directory.

### "HTTP request failed" or connection refused
The checkpoint source node is down or unreachable. Verify with `curl` that the source endpoint returns a healthy response.

### Node exits immediately after start
Check `docker logs <node_name>` for verification errors. Checkpoint sync exits on any failure without modifying the database, so it's safe to retry.

### Node syncs but doesn't finalize
If the restarted node is the aggregator, attestations won't be aggregated and blocks will be produced with `attestation_count=0` until it catches back up to head. Finalization resumes once the aggregator is fully synced and participating in consensus again.

### "Fallback pruning (finalization stalled)" after catch-up
Normal during catch-up. The node accumulated blocks faster than finalization can advance. This resolves once the node is fully caught up and participating in consensus.
Loading