Claude/debug network performance etj9 w by luhenry · Pull Request #20 · riseproject-dev/riscv-runner-app

luhenry · 2026-05-10T18:18:20Z

No description provided.

These cover the two metrics in the debugging plan that no built-in node_exporter collector exposes: raw_github_probe.py (5 min timer) Downloads a fixed artefact from raw.githubusercontent.com (Fastly, customer's actual path) and from speed.cloudflare.com (non-Fastly control). curl is shelled out so we get %{remote_ip} reflecting the IP libcurl actually connected to. Emits raw_github_probe_seconds, _bytes_per_second and _curl_exit_code labelled by target and remote_ip. When the Fastly target sags but the Cloudflare control stays flat, the issue is on the Scaleway-Fastly path (H9), not Scaleway WAN egress generally (H5). dns_probe.py (60 s timer) Resolves raw.githubusercontent.com via socket.getaddrinfo (same resolver libc/curl uses) and emits an info-style series runner_dns_resolved_ip{ip="..."}=1 plus a count. Lets us correlate slow windows with cache-region or POP flips (H1). Both scripts run as the node_exporter user, write atomically (mkstemp + os.replace) to the textfile collector dir, and self-recover from probe failures by emitting curl_exit_code=99 instead of skipping the row. Stdlib only — no extra apt packages needed. https://claude.ai/code/session_01BgToCb8eDGsrkyTddCtt9Z

Ten panels organized by hypothesis (H10, H9, sentinels), in the Grafana v2alpha dashboard schema (kind/spec wrappers, AutoGridLayout, elements map keyed by panel-N) used by Scaleway's managed Grafana. H10 (single-queue NIC / CPU0 softirq saturation) - CPU softirq% by CPU (default cpu collector) - NET_RX softirq deliveries/sec (--collector.softirqs) - end0 hard IRQs/sec by CPU (--collector.interrupts) H9 (TCP-stack health, Scaleway-Fastly path proxy) - TCPTimeouts/sec - Retransmit ratio with 0.5% threshold - Lost retransmits & spurious RTOs Sentinels - end0 throughput vs 100 Mbps line - end0 NIC drops & errors (H2 should stay flat) - Conntrack utilisation (H3 should stay << 0.01) - EEE / LPI enter rate (H7 fallback; --collector.ethtool) Datasource UID hardcoded to fflnugavx2h34c (Scaleway Cockpit metrics data source for this project). Layout is AutoGridLayout with three columns; panels flow into four rows. No template variables yet — adding a node selector requires a v2-schema sample. https://claude.ai/code/session_01BgToCb8eDGsrkyTddCtt9Z

- panel-11 (Sentinel: end0 TX drops & errors) mirrors panel-8 against node_network_transmit_{drop,errs}_total. Layout puts panels 7, 8, 11 in the same row so throughput, RX errs, and TX errs sit visually grouped under one NIC sentinel band. - panel-8 renamed to "RX drops & errors" to pair cleanly with the new TX panel. - New `node` query variable (multi, includeAll) using label_values(node) so the dashboard can be filtered per-node. Every panel query now selects on node=~"$node" so the variable actually scopes results. https://claude.ai/code/session_01BgToCb8eDGsrkyTddCtt9Z

Two reasons panels in the dashboard were empty: 1. Dashboard query typo. The softirq panel queried node_softirq_total (singular) but node_exporter exposes node_softirqs_total (plural). 2. node_exporter's --collector.netstat.fields default regex excludes TCPLostRetransmit and TCPSpuriousRTOs (and a number of other TcpExt_* fields). The collector reads them from /proc/net/netstat but drops them before exposing. Setting the filter to ^.*$ exposes the full set; cardinality bump on a runner is a few dozen series per node, negligible. The remaining two empty panels (NET_RX softirq cross-check, EEE/LPI) are expected to start populating once a runner is reprovisioned with this scw.py — verifiable on the node via curl -s 127.0.0.1:9100/metrics | grep -E '^node_(softirqs|interrupts|ethtool)_' https://claude.ai/code/session_01BgToCb8eDGsrkyTddCtt9Z

The light_dwmac_eth driver names the counter irq_tx_path_in_lpi_mode_n, but st_gmac (the driver on the runner that exported the metrics) names it irq_transmitted_path_in_lpi_mode_n. The H7 sentinel panel was querying the light_dwmac form and getting no data. Use the st_gmac form, which is the form actually exposed on the fleet. https://claude.ai/code/session_01BgToCb8eDGsrkyTddCtt9Z

Two unrelated query fixes against current node_exporter (1.11.1): 1. node_exporter renamed the softirqs metric to node_softirqs_functions_total. The H10 NET_RX panel was querying the old node_softirqs_total. (Raw data already confirms H10: CPU0 has 1.55M NET_RX softirqs vs ~5K on each other CPU.) 2. The interrupts collector exposes the device name under the `devices` label, not `type` (which is the IRQ number). Filter end0 IRQs via devices=~".*end0.*"; legend now shows cpu, dev, and irq for clarity. https://claude.ai/code/session_01BgToCb8eDGsrkyTddCtt9Z

Six rows: H1 Bad Fastly POP / IPv6 path 3 new panels (12-14) H5 Scaleway WAN egress contention 2 new panels (15-16) H7 EEE / LPI micro-stalls (fallback) panel 10 H9 TCP-stack health panels 4-6 H10 Single-core CPU0 softirq saturation panels 1-3 Sentinels (H2/H3, refuted) panels 7, 8, 11, 9 H2 (NIC errors), H3 (conntrack), H4 (PMTU), H6 (ASN throttle), H8 (in-host contention) don't get their own rows: H2/H3 are sentinels, H4/H6/H8 have no on-host Prometheus metric (live capture or off-host probe only). H1/H5 panels read raw_github_probe_seconds, _bytes_per_second (faceted by remote_ip/target) and runner_dns_resolved_ip; they populate once raw_github_probe.py and dns_probe.py are running on the node, otherwise the panels are empty placeholders waiting for the probe data. Layout uses RowsLayout containing one RowsLayoutRow per hypothesis, each wrapping an AutoGridLayout with maxColumnCount=3 — schema matches the v2 sample exported by Scaleway-managed Grafana. https://claude.ai/code/session_01BgToCb8eDGsrkyTddCtt9Z

Mirrors the existing NET_RX softirq panel against node_softirqs_functions_total{type="NET_TX"}. Slots into the H10 row between NET_RX softirq and end0 hard IRQs so RX/TX softirq pressure sit side-by-side, with the hard-IRQ panel on the next visual row. https://claude.ai/code/session_01BgToCb8eDGsrkyTddCtt9Z

The probes were embedded as bash heredocs inside SETUP_SCRIPT; move the canonical sources to scripts/probes/raw_github_probe.py and scripts/probes/dns_probe.py and have run_setup read and substitute them in. Behaviour identical; SETUP_SCRIPT shrinks by ~140 lines. https://claude.ai/code/session_01BgToCb8eDGsrkyTddCtt9Z

H9: Spurious-RTO ratio (panel 18) TCPSpuriousRTOs / TCPTimeouts. ~1.0 means most RTOs are spurious (network was fine, kernel was impatient — rto_min tuning); ~0 means genuine packet loss. H9: Fast recovery vs RTO (panel 19) TCPSackRecovery + TCPRenoRecovery (cheap, cwnd preserved) vs TCPTimeouts (expensive, cwnd collapsed). When RTO dominates the loss pattern is severe/bursty. H10: softnet drops & NAPI squeezes per CPU (panel 20) rate(node_softnet_dropped_total) and _times_squeezed_total. If times_squeezed > 0 on the CPU that handles end0 IRQs, the NAPI poll exhausted its budget and the NIC dropped at the ring buffer before TCP saw it — connects H10 (CPU0 softirq saturation) to H9 (downstream TCP timeouts). https://claude.ai/code/session_01BgToCb8eDGsrkyTddCtt9Z

luhenry force-pushed the main branch from 62a6e03 to af47d97 Compare May 10, 2026 20:29

luhenry added 9 commits May 10, 2026 20:30

luhenry force-pushed the claude/debug-network-performance-Etj9W branch from d5c3e74 to c8d10ef Compare May 10, 2026 20:31

luhenry closed this May 10, 2026

luhenry deleted the claude/debug-network-performance-Etj9W branch May 11, 2026 14:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Claude/debug network performance etj9 w#20

Claude/debug network performance etj9 w#20
luhenry wants to merge 10 commits into
mainfrom
claude/debug-network-performance-Etj9W

luhenry commented May 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

luhenry commented May 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant