Skip to content

Separate port configuration for API RPC and metrics#210

Merged
pablodeymo merged 8 commits intomainfrom
separate-rpc-metrics-ports
Mar 13, 2026
Merged

Separate port configuration for API RPC and metrics#210
pablodeymo merged 8 commits intomainfrom
separate-rpc-metrics-ports

Conversation

@pablodeymo
Copy link
Collaborator

@pablodeymo pablodeymo commented Mar 13, 2026

Closes #206

Motivation

The API RPC endpoints and Prometheus metrics were served from a single Axum server on one port (--metrics-port 5054). This made it impossible to expose metrics to a monitoring stack (e.g., Prometheus scraper) without also exposing the API, or to restrict API access without blocking metrics collection. In production and devnet deployments, these often need different network policies.

Description

Split the single HTTP server into two independent servers, each with its own port:

API Server (--api-port, default 5054)

  • GET /lean/v0/health — Health check (moved from metrics)
  • GET /lean/v0/states/finalized — Finalized state (SSZ)
  • GET /lean/v0/checkpoints/justified — Justified checkpoint (JSON)
  • GET /lean/v0/fork_choice — Fork choice tree (JSON)
  • GET /lean/v0/fork_choice/ui — Interactive fork choice UI (HTML)

Metrics Server (--metrics-port, default 5055)

  • GET /metrics — Prometheus metrics
  • GET /debug/pprof/allocs — Heap profiling
  • GET /debug/pprof/allocs/flamegraph — Heap flamegraph

CLI Flag Changes

Flag Before After
--metrics-address 127.0.0.1 Renamed to --http-address (shared by both servers)
--metrics-port 5054 Default changed to 5055
--api-port N/A (new) Default 5054

Architecture

Before:                          After:

--metrics-address:metrics-port   --http-address:api-port
┌─────────────────────────┐      ┌─────────────────────────┐
│    Single Axum Server   │      │      API Server         │
│         :5054           │      │        :5054             │
├─────────────────────────┤      ├─────────────────────────┤
│ /metrics                │      │ /lean/v0/health         │
│ /lean/v0/health         │      │ /lean/v0/states/*       │
│ /lean/v0/states/*       │      │ /lean/v0/checkpoints/*  │
│ /lean/v0/checkpoints/*  │      │ /lean/v0/fork_choice*   │
│ /lean/v0/fork_choice*   │      └─────────────────────────┘
│ /debug/pprof/*          │
└─────────────────────────┘      --http-address:metrics-port
                                 ┌─────────────────────────┐
                                 │    Metrics Server        │
                                 │        :5055             │
                                 ├─────────────────────────┤
                                 │ /metrics                 │
                                 │ /debug/pprof/*           │
                                 └─────────────────────────┘

Files Changed

  • bin/ethlambda/src/main.rs — Renamed --metrics-address to --http-address, added --api-port flag, changed --metrics-port default to 5055, spawn both servers via tokio::spawn with error logging
  • crates/net/rpc/src/lib.rs — Split start_rpc_server() into start_api_server(address, store) and start_metrics_server(address), moved /lean/v0/health route to the API router
  • crates/net/rpc/src/metrics.rs — Removed /lean/v0/health route (now served by API router)
  • preview-config.nix — Updated CLI flags (--http-address, --api-port), split port ranges: API on 8081-8084, metrics on 8085-8088
  • Dockerfile — Added EXPOSE 5055, updated port comments
  • docs/metrics.md — Updated default endpoint URL to port 5055
  • docs/fork_choice_visualization.md — Updated flag references to --api-port

Design Decisions

  • Health on API server: /lean/v0/health is an API endpoint (used by load balancers, checkpoint sync clients), not an ops/metrics endpoint, so it belongs with the API.
  • pprof on metrics server: Heap profiling is an ops concern, co-located with Prometheus metrics.
  • Shared --http-address: Both servers bind to the same address. Per-server addresses would add complexity without clear benefit today.
  • Different defaults: 5054 for API (preserves existing behavior for API consumers) and 5055 for metrics (avoids port conflict).
  • Error logging on spawn: Both servers log errors if they fail to bind or crash, instead of silently dropping the JoinHandle.

How to Test

# Run with defaults (API on :5054, metrics on :5055)
cargo run -- --custom-network-config-dir <config> --node-key <key> --node-id <id>

# Verify API server
curl http://127.0.0.1:5054/lean/v0/health
curl http://127.0.0.1:5054/lean/v0/checkpoints/justified

# Verify metrics server
curl http://127.0.0.1:5055/metrics

# Verify separation (these should fail)
curl http://127.0.0.1:5054/metrics        # 404
curl http://127.0.0.1:5055/lean/v0/health  # connection refused or 404

# Custom ports
cargo run -- --api-port 8080 --metrics-port 9090 ...

Breaking Changes

  • --metrics-address renamed to --http-address
  • --metrics-port default changed from 5054 to 5055
  • /lean/v0/health no longer available on the metrics port
  • Preview nodes: metrics now on ports 8085-8088 (were sharing 8081-8084 with API)

Any deployment scripts or docker-compose files referencing --metrics-address or scraping metrics on port 5054 will need updating.

Split the single HTTP server into two independent servers:
- API server (--api-port, default 5054): health, states, checkpoints, fork choice
- Metrics server (--metrics-port, default 5055): prometheus metrics, pprof

Rename --metrics-address to --http-address (shared by both servers).
Move /lean/v0/health from metrics router to API router.
@github-actions
Copy link

🤖 Kimi Code Review

Review Summary

This PR splits the monolithic RPC server into separate API and metrics servers. The changes are generally well-structured, but there are several issues to address:

1. Critical: Missing Error Handling (main.rs:170-171)

The spawned servers use .unwrap() implicitly via tokio::spawn. When these futures complete with errors, the tasks will panic silently. Add proper error handling:

tokio::spawn(async move {
    if let Err(e) = ethlambda_rpc::start_metrics_server(metrics_socket).await {
        error!(%e, "Metrics server failed");
    }
});

2. Security: Default Bind Address (main.rs:51)

Binding metrics to 127.0.0.1 by default is good for security, but consider adding validation to prevent accidental exposure. Add a warning if binding to 0.0.0.0:

if options.http_address == IpAddr::from([0, 0, 0, 0]) {
    warn!("HTTP API binding to all interfaces (0.0.0.0) - ensure this is intentional");
}

3. Port Collision Risk (main.rs:53-55)

The new api_port default (5054) and metrics_port default (5055) are adjacent. Consider documenting this or making the defaults more distinct to prevent accidental collisions.

4. Code Duplication (rpc/src/lib.rs:15-28)

Both start_api_server and start_metrics_server have identical binding/serve patterns. Extract to a helper:

async fn start_server(address: SocketAddr, router: Router) -> Result<(), std::io::Error> {
    let listener = tokio::net::TcpListener::bind(address).await?;
    axum::serve(listener, router).await
}

5. API Route Organization (rpc/src/lib.rs:38)

The health endpoint /lean/v0/health is now in the API router but was previously in metrics. This is a breaking change that could break monitoring. Either:

  • Keep it in metrics for backward compatibility, or
  • Document this change clearly in release notes

6. Missing Graceful Shutdown (main.rs:170-171)

The spawned servers have no graceful shutdown mechanism. Consider adding a shutdown signal handler:

let api_handle = tokio::spawn(ethlambda_rpc::start_api_server(api_socket, store));
let metrics_handle = tokio::spawn(ethlambda_rpc::start_metrics_server(metrics_socket));

// Later in shutdown handling
api_handle.abort();
metrics_handle.abort();

7. Naming Consistency (main.rs:50)

The variable http_address is generic but only used for API/metrics. Consider api_metrics_address or similar for clarity.

These changes improve the architecture by separating concerns, but need the above fixes for production readiness.


Automated review by Kimi (Moonshot AI) · custom prompt

@github-actions
Copy link

🤖 Codex Code Review

  1. High: server startup failures are now silently ignored
    In bin/ethlambda/src/main.rs:170 and bin/ethlambda/src/main.rs:171, both servers are spawned and their Result is never observed.

    • If bind fails (port in use, permission denied), the task exits with Err, but the node still logs Node initialized at line 173.
    • This is a correctness and operational risk: monitoring/API may be down while the node appears healthy.
    • Suggested fix: keep JoinHandles and either:
      • tokio::try_join! them with shutdown handling, or
      • wrap each task with logging on Err (at minimum), and fail fast on startup bind errors.
  2. Medium: changed CLI flag breaks backward compatibility without alias/deprecation path
    --metrics-address was replaced by --http-address in bin/ethlambda/src/main.rs:51-52. Existing deployments/scripts using --metrics-address will fail argument parsing.

    • Suggested fix: add a clap alias/deprecated arg path to preserve compatibility for at least one release.
  3. Medium: API/metrics port collision is not validated
    api_port and metrics_port are independently configurable (bin/ethlambda/src/main.rs:53-56, sockets built at lines 88-89). If equal, one listener fails; due to Point 1 this may be silent.

    • Suggested fix: validate api_port != metrics_port at startup or explicitly support shared-port mode with one router.

No consensus-critical logic (fork choice, attestation processing, finalization, state transition, XMSS, SSZ) appears modified in this PR. The routing split is reasonable, but startup/error-handling behavior needs tightening before merge.


Automated review by OpenAI Codex · custom prompt

@github-actions
Copy link

🤖 Claude Code Review

Here is my review of PR 210:


PR 210: Separate port configuration for API RPC and metrics

The motivation is sound — decoupling the API and metrics surfaces for independent network policy control is a standard production pattern. The implementation is clean and minimal.


Bug: Server errors silently dropped

bin/ethlambda/src/main.rs, lines 170–171

tokio::spawn(ethlambda_rpc::start_metrics_server(metrics_socket));
tokio::spawn(ethlambda_rpc::start_api_server(api_socket, store));

The JoinHandles are immediately dropped, which means any Err returned by either server (e.g., EADDRINUSE on bind, or a listener crash) is silently discarded. The process will then sit at ctrl_c().await appearing healthy while serving neither API nor metrics.

Previously the server was awaited and .unwrap()'d, so a bind failure caused a visible panic/exit. This PR regresses that operational visibility.

The minimum fix is to log the error inside the spawned task:

tokio::spawn(async move {
    ethlambda_rpc::start_metrics_server(metrics_socket)
        .await
        .inspect_err(|err| error!(%err, "Metrics server failed"))
        .ok();
});

Same for the API server.


Minor: Startup log races bind

info!("Node initialized") is emitted immediately after tokio::spawn(...), before either server has actually bound its port (the TcpListener::bind happens inside the spawned futures). An operator reading the log may think the node is fully ready when the servers have not yet bound. This is low severity, but worth noting.


Correctness and design: everything else looks good

  • The split of start_rpc_server into start_api_server / start_metrics_server is clean — no logic is duplicated, no routes are lost.
  • Moving /lean/v0/health to the API router is the right call: health checks are API concerns, not metrics concerns.
  • get_health visibility (pub(crate)) and its reference to crate::JSON_CONTENT_TYPE are correct after the reorganisation.
  • The shared --http-address binding for both servers is a reasonable simplification; operators needing per-server addresses can be addressed in a follow-up.
  • The breaking CLI rename (--metrics-address--http-address) and port default change (5054 → 5055 for metrics) are clearly documented in the PR description.

Summary

One real bug (silently swallowed server errors) that needs fixing before merge; everything else is in good shape.


Automated review by Claude (Anthropic) · custom prompt

@greptile-apps
Copy link
Contributor

greptile-apps bot commented Mar 13, 2026

Greptile Summary

This PR splits the single Axum server into two independent servers — an API server (default :5054) serving RPC and health endpoints, and a metrics server (default :5055) serving Prometheus and pprof — providing the network-policy separation described in the motivation. The crates/net/rpc changes are clean and correct. Three issues need attention before merging:

  • Silent server failures (bin/ethlambda/src/main.rs): Both tokio::spawn calls drop their JoinHandle immediately. If either server fails to bind (e.g., port conflict) or crashes at runtime, the error is silently swallowed and the node continues running without the affected server. The original .await.unwrap() pattern would at least panic on a bind failure. The handles should be retained and errors logged or used to abort the process.
  • Broken preview deployment (preview-config.nix): The Nix systemd service still passes --metrics-address which clap will reject as an unknown flag, causing all four preview nodes to fail to start. It needs to be updated to --http-address.
  • Unexposed Docker port (Dockerfile): The EXPOSE directive and its comments reflect the old single-port layout. Port 5055 (new metrics default) is not listed, so container-based Prometheus scrapers will be unable to reach metrics unless the port is manually mapped.

Documentation in docs/metrics.md and docs/fork_choice_visualization.md also still references the old port 5054 for metrics and the --metrics-address flag, and should be updated alongside this change.

Confidence Score: 2/5

  • Not safe to merge — the preview Nix config uses a removed CLI flag that will crash all preview nodes on restart, and the metrics server port is missing from the Dockerfile EXPOSE directive.
  • The core RPC split logic is correct, but there are three concrete regressions: a deployment-breaking stale CLI flag in preview-config.nix, a missing EXPOSE port in the Dockerfile, and silent swallowing of server startup/runtime errors in main.rs due to dropped JoinHandles.
  • bin/ethlambda/src/main.rs (dropped JoinHandles), preview-config.nix (stale --metrics-address flag), Dockerfile (missing port 5055 EXPOSE)

Important Files Changed

Filename Overview
bin/ethlambda/src/main.rs Adds --api-port / --http-address CLI flags and spawns two servers via tokio::spawn, but drops both JoinHandles — bind failures and runtime errors from either server are silently swallowed.
crates/net/rpc/src/lib.rs Cleanly splits start_rpc_server into start_api_server and start_metrics_server, moves /lean/v0/health to the API router, and keeps debug profiling routes on the metrics server — logic is correct.
crates/net/rpc/src/metrics.rs Removes /lean/v0/health from the Prometheus router; get_health handler is retained as pub(crate) for reuse by the API router — straightforward and correct.

Sequence Diagram

sequenceDiagram
    participant Main as main.rs
    participant API as start_api_server(:5054)
    participant Metrics as start_metrics_server(:5055)
    participant Client as API Client
    participant Prom as Prometheus Scraper

    Main->>API: tokio::spawn (JoinHandle dropped)
    Main->>Metrics: tokio::spawn (JoinHandle dropped)
    Main->>Main: wait for ctrl_c

    Client->>API: GET /lean/v0/health
    API-->>Client: 200 OK {"status":"healthy"}

    Client->>API: GET /lean/v0/states/finalized
    API-->>Client: 200 SSZ bytes

    Client->>API: GET /lean/v0/checkpoints/justified
    API-->>Client: 200 JSON checkpoint

    Client->>API: GET /lean/v0/fork_choice[/ui]
    API-->>Client: 200 JSON / HTML

    Prom->>Metrics: GET /metrics
    Metrics-->>Prom: 200 Prometheus text

    Prom->>Metrics: GET /debug/pprof/allocs[/flamegraph]
    Metrics-->>Prom: 200 heap profile
Loading

Comments Outside Diff (2)

  1. preview-config.nix, line 95 (link)

    Stale --metrics-address flag breaks preview nodes

    The --metrics-address CLI flag was renamed to --http-address in this PR, but preview-config.nix still passes the old name. clap will reject the unknown flag and refuse to start, causing all four preview systemd services to fail immediately.

    Prompt To Fix With AI
    This is a comment left during a code review.
    Path: preview-config.nix
    Line: 95
    
    Comment:
    **Stale `--metrics-address` flag breaks preview nodes**
    
    The `--metrics-address` CLI flag was renamed to `--http-address` in this PR, but `preview-config.nix` still passes the old name. `clap` will reject the unknown flag and refuse to start, causing all four preview systemd services to fail immediately.
    
    
    
    How can I resolve this? If you propose a fix, please make it concise.
  2. Dockerfile, line 67 (link)

    New metrics port 5055 not exposed in Dockerfile

    The EXPOSE directive still only lists port 5054. After this change the metrics server defaults to port 5055, so Prometheus scrapers in container-based deployments won't be able to reach it unless the port is explicitly mapped. The comment even says "5054 - Prometheus metrics" which is now incorrect (5054 is the API port, 5055 is metrics).

    Also update the adjacent comment to reflect the new assignment:

    # 5054 - HTTP API
    # 5055 - Prometheus metrics
    
    Prompt To Fix With AI
    This is a comment left during a code review.
    Path: Dockerfile
    Line: 67
    
    Comment:
    **New metrics port `5055` not exposed in Dockerfile**
    
    The `EXPOSE` directive still only lists port `5054`. After this change the metrics server defaults to port `5055`, so Prometheus scrapers in container-based deployments won't be able to reach it unless the port is explicitly mapped. The comment even says "5054 - Prometheus metrics" which is now incorrect (5054 is the API port, 5055 is metrics).
    
    
    
    Also update the adjacent comment to reflect the new assignment:
    ```
    # 5054 - HTTP API
    # 5055 - Prometheus metrics
    ```
    
    How can I resolve this? If you propose a fix, please make it concise.
Prompt To Fix All With AI
This is a comment left during a code review.
Path: bin/ethlambda/src/main.rs
Line: 170-171

Comment:
**Server failures are silently dropped**

Both `tokio::spawn` calls immediately drop their `JoinHandle`, which means any errors — including a bind failure (e.g., port already in use) — are silently swallowed. If either server fails to start, the node will continue running without logging an error, with one or both HTTP surfaces unavailable. The previous code used `.await.unwrap()`, which would at least panic on bind failure.

Consider retaining the handles and either aborting the process or surfacing errors:

```rust
let metrics_handle = tokio::spawn(async move {
    if let Err(err) = ethlambda_rpc::start_metrics_server(metrics_socket).await {
        error!(%err, "Metrics server failed");
    }
});
let api_handle = tokio::spawn(async move {
    if let Err(err) = ethlambda_rpc::start_api_server(api_socket, store).await {
        error!(%err, "API server failed");
    }
});

info!("Node initialized");

tokio::signal::ctrl_c().await.ok();
println!("Shutting down...");
metrics_handle.abort();
api_handle.abort();
```

At minimum, the `Result` from each future should be logged or cause the process to exit, so operators learn of a failed bind instead of wondering why metrics/API are unreachable.

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: preview-config.nix
Line: 95

Comment:
**Stale `--metrics-address` flag breaks preview nodes**

The `--metrics-address` CLI flag was renamed to `--http-address` in this PR, but `preview-config.nix` still passes the old name. `clap` will reject the unknown flag and refuse to start, causing all four preview systemd services to fail immediately.

```suggestion
          "--http-address" "0.0.0.0"
```

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: Dockerfile
Line: 67

Comment:
**New metrics port `5055` not exposed in Dockerfile**

The `EXPOSE` directive still only lists port `5054`. After this change the metrics server defaults to port `5055`, so Prometheus scrapers in container-based deployments won't be able to reach it unless the port is explicitly mapped. The comment even says "5054 - Prometheus metrics" which is now incorrect (5054 is the API port, 5055 is metrics).

```suggestion
EXPOSE 9000/tcp 9000/udp 9001/udp 5052 5054 5055
```

Also update the adjacent comment to reflect the new assignment:
```
# 5054 - HTTP API
# 5055 - Prometheus metrics
```

How can I resolve this? If you propose a fix, please make it concise.

Last reviewed commit: 0da3ed0

Comment on lines +170 to +171
tokio::spawn(ethlambda_rpc::start_metrics_server(metrics_socket));
tokio::spawn(ethlambda_rpc::start_api_server(api_socket, store));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Server failures are silently dropped

Both tokio::spawn calls immediately drop their JoinHandle, which means any errors — including a bind failure (e.g., port already in use) — are silently swallowed. If either server fails to start, the node will continue running without logging an error, with one or both HTTP surfaces unavailable. The previous code used .await.unwrap(), which would at least panic on bind failure.

Consider retaining the handles and either aborting the process or surfacing errors:

let metrics_handle = tokio::spawn(async move {
    if let Err(err) = ethlambda_rpc::start_metrics_server(metrics_socket).await {
        error!(%err, "Metrics server failed");
    }
});
let api_handle = tokio::spawn(async move {
    if let Err(err) = ethlambda_rpc::start_api_server(api_socket, store).await {
        error!(%err, "API server failed");
    }
});

info!("Node initialized");

tokio::signal::ctrl_c().await.ok();
println!("Shutting down...");
metrics_handle.abort();
api_handle.abort();

At minimum, the Result from each future should be logged or cause the process to exit, so operators learn of a failed bind instead of wondering why metrics/API are unreachable.

Prompt To Fix With AI
This is a comment left during a code review.
Path: bin/ethlambda/src/main.rs
Line: 170-171

Comment:
**Server failures are silently dropped**

Both `tokio::spawn` calls immediately drop their `JoinHandle`, which means any errors — including a bind failure (e.g., port already in use) — are silently swallowed. If either server fails to start, the node will continue running without logging an error, with one or both HTTP surfaces unavailable. The previous code used `.await.unwrap()`, which would at least panic on bind failure.

Consider retaining the handles and either aborting the process or surfacing errors:

```rust
let metrics_handle = tokio::spawn(async move {
    if let Err(err) = ethlambda_rpc::start_metrics_server(metrics_socket).await {
        error!(%err, "Metrics server failed");
    }
});
let api_handle = tokio::spawn(async move {
    if let Err(err) = ethlambda_rpc::start_api_server(api_socket, store).await {
        error!(%err, "API server failed");
    }
});

info!("Node initialized");

tokio::signal::ctrl_c().await.ok();
println!("Shutting down...");
metrics_handle.abort();
api_handle.abort();
```

At minimum, the `Result` from each future should be logged or cause the process to exit, so operators learn of a failed bind instead of wondering why metrics/API are unreachable.

How can I resolve this? If you propose a fix, please make it concise.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed — both spawns now wrap the future and log errors with error!().

- Log errors from spawned API/metrics servers instead of silently dropping JoinHandles
- Update preview-config.nix: rename --metrics-address to --http-address, add --api-port,
  assign metrics ports to 8085-8088 range
- Expose port 5055 in Dockerfile and update port comments
- Update docs to reflect new flag names and default ports
@pablodeymo
Copy link
Collaborator Author

All three issues addressed in 0570730:

  1. Server failures: Both spawns now log errors via error!() instead of dropping the JoinHandle silently.
  2. preview-config.nix: Updated --metrics-address--http-address, added --api-port, split ports (API 8081-8084, metrics 8085-8088).
  3. Dockerfile: Added EXPOSE 5055 and updated port comments.

Also updated docs/metrics.md and docs/fork_choice_visualization.md with the new flag names and defaults.

Match Lighthouse's port convention for drop-in compatibility with
existing monitoring and deployment setups:
- --api-port default: 5054 -> 5052
- --metrics-port default: 5055 -> 5054
@MegaRedHand
Copy link
Collaborator

MegaRedHand commented Mar 13, 2026

Update Claude's context with this information. Some parts mention we only support setting the metrics port.

pablodeymo added a commit that referenced this pull request Mar 13, 2026
Update crate tree description and add new "HTTP Servers (API + Metrics)"
section covering CLI flags (--http-address, --api-port, --metrics-port),
API server routes (:5052), metrics server routes (:5054), and startup
behavior. Reflects the server split from PR #210.
@pablodeymo
Copy link
Collaborator Author

Done in f0ac667 — updated CLAUDE.md with the new dual-server architecture: CLI flags (--http-address, --api-port, --metrics-port), API server routes (:5052), metrics server routes (:5054), and startup behavior.

Reflect PR #210 changes: ethlambda now runs separate API (--api-port,
default 5052) and metrics (--metrics-port, default 5054) HTTP servers
with a shared bind address (--http-address). Updated validator config
schema, port allocation guide, troubleshooting, client reference, and
known issues sections.
… docs

Update SKILL.md and checkpoint-sync.md to reflect that ethlambda now
serves API endpoints on --api-port (default 5052) and metrics on
--metrics-port (default 5054). Checkpoint sync URLs must use the API
port, not the metrics port.
@pablodeymo pablodeymo merged commit 3e54039 into main Mar 13, 2026
2 checks passed
@pablodeymo pablodeymo deleted the separate-rpc-metrics-ports branch March 13, 2026 18:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Separate port configuration for API RPC and metrics

2 participants