Claude/debug network performance etj9 w#20
Closed
luhenry wants to merge 10 commits into
Closed
Conversation
These cover the two metrics in the debugging plan that no built-in
node_exporter collector exposes:
raw_github_probe.py (5 min timer)
Downloads a fixed artefact from raw.githubusercontent.com (Fastly,
customer's actual path) and from speed.cloudflare.com (non-Fastly
control). curl is shelled out so we get %{remote_ip} reflecting the
IP libcurl actually connected to. Emits raw_github_probe_seconds,
_bytes_per_second and _curl_exit_code labelled by target and
remote_ip. When the Fastly target sags but the Cloudflare control
stays flat, the issue is on the Scaleway-Fastly path (H9), not
Scaleway WAN egress generally (H5).
dns_probe.py (60 s timer)
Resolves raw.githubusercontent.com via socket.getaddrinfo (same
resolver libc/curl uses) and emits an info-style series
runner_dns_resolved_ip{ip="..."}=1 plus a count. Lets us correlate
slow windows with cache-region or POP flips (H1).
Both scripts run as the node_exporter user, write atomically (mkstemp +
os.replace) to the textfile collector dir, and self-recover from probe
failures by emitting curl_exit_code=99 instead of skipping the row.
Stdlib only — no extra apt packages needed.
https://claude.ai/code/session_01BgToCb8eDGsrkyTddCtt9Z
Ten panels organized by hypothesis (H10, H9, sentinels), in the
Grafana v2alpha dashboard schema (kind/spec wrappers, AutoGridLayout,
elements map keyed by panel-N) used by Scaleway's managed Grafana.
H10 (single-queue NIC / CPU0 softirq saturation)
- CPU softirq% by CPU (default cpu collector)
- NET_RX softirq deliveries/sec (--collector.softirqs)
- end0 hard IRQs/sec by CPU (--collector.interrupts)
H9 (TCP-stack health, Scaleway-Fastly path proxy)
- TCPTimeouts/sec
- Retransmit ratio with 0.5% threshold
- Lost retransmits & spurious RTOs
Sentinels
- end0 throughput vs 100 Mbps line
- end0 NIC drops & errors (H2 should stay flat)
- Conntrack utilisation (H3 should stay << 0.01)
- EEE / LPI enter rate (H7 fallback; --collector.ethtool)
Datasource UID hardcoded to fflnugavx2h34c (Scaleway Cockpit metrics
data source for this project). Layout is AutoGridLayout with three
columns; panels flow into four rows. No template variables yet —
adding a node selector requires a v2-schema sample.
https://claude.ai/code/session_01BgToCb8eDGsrkyTddCtt9Z
- panel-11 (Sentinel: end0 TX drops & errors) mirrors panel-8 against
node_network_transmit_{drop,errs}_total. Layout puts panels 7, 8, 11
in the same row so throughput, RX errs, and TX errs sit visually
grouped under one NIC sentinel band.
- panel-8 renamed to "RX drops & errors" to pair cleanly with the new
TX panel.
- New `node` query variable (multi, includeAll) using
label_values(node) so the dashboard can be filtered per-node. Every
panel query now selects on node=~"$node" so the variable actually
scopes results.
https://claude.ai/code/session_01BgToCb8eDGsrkyTddCtt9Z
Two reasons panels in the dashboard were empty: 1. Dashboard query typo. The softirq panel queried node_softirq_total (singular) but node_exporter exposes node_softirqs_total (plural). 2. node_exporter's --collector.netstat.fields default regex excludes TCPLostRetransmit and TCPSpuriousRTOs (and a number of other TcpExt_* fields). The collector reads them from /proc/net/netstat but drops them before exposing. Setting the filter to ^.*$ exposes the full set; cardinality bump on a runner is a few dozen series per node, negligible. The remaining two empty panels (NET_RX softirq cross-check, EEE/LPI) are expected to start populating once a runner is reprovisioned with this scw.py — verifiable on the node via curl -s 127.0.0.1:9100/metrics | grep -E '^node_(softirqs|interrupts|ethtool)_' https://claude.ai/code/session_01BgToCb8eDGsrkyTddCtt9Z
The light_dwmac_eth driver names the counter irq_tx_path_in_lpi_mode_n, but st_gmac (the driver on the runner that exported the metrics) names it irq_transmitted_path_in_lpi_mode_n. The H7 sentinel panel was querying the light_dwmac form and getting no data. Use the st_gmac form, which is the form actually exposed on the fleet. https://claude.ai/code/session_01BgToCb8eDGsrkyTddCtt9Z
Two unrelated query fixes against current node_exporter (1.11.1): 1. node_exporter renamed the softirqs metric to node_softirqs_functions_total. The H10 NET_RX panel was querying the old node_softirqs_total. (Raw data already confirms H10: CPU0 has 1.55M NET_RX softirqs vs ~5K on each other CPU.) 2. The interrupts collector exposes the device name under the `devices` label, not `type` (which is the IRQ number). Filter end0 IRQs via devices=~".*end0.*"; legend now shows cpu, dev, and irq for clarity. https://claude.ai/code/session_01BgToCb8eDGsrkyTddCtt9Z
Six rows: H1 Bad Fastly POP / IPv6 path 3 new panels (12-14) H5 Scaleway WAN egress contention 2 new panels (15-16) H7 EEE / LPI micro-stalls (fallback) panel 10 H9 TCP-stack health panels 4-6 H10 Single-core CPU0 softirq saturation panels 1-3 Sentinels (H2/H3, refuted) panels 7, 8, 11, 9 H2 (NIC errors), H3 (conntrack), H4 (PMTU), H6 (ASN throttle), H8 (in-host contention) don't get their own rows: H2/H3 are sentinels, H4/H6/H8 have no on-host Prometheus metric (live capture or off-host probe only). H1/H5 panels read raw_github_probe_seconds, _bytes_per_second (faceted by remote_ip/target) and runner_dns_resolved_ip; they populate once raw_github_probe.py and dns_probe.py are running on the node, otherwise the panels are empty placeholders waiting for the probe data. Layout uses RowsLayout containing one RowsLayoutRow per hypothesis, each wrapping an AutoGridLayout with maxColumnCount=3 — schema matches the v2 sample exported by Scaleway-managed Grafana. https://claude.ai/code/session_01BgToCb8eDGsrkyTddCtt9Z
Mirrors the existing NET_RX softirq panel against
node_softirqs_functions_total{type="NET_TX"}. Slots into the H10 row
between NET_RX softirq and end0 hard IRQs so RX/TX softirq pressure
sit side-by-side, with the hard-IRQ panel on the next visual row.
https://claude.ai/code/session_01BgToCb8eDGsrkyTddCtt9Z
The probes were embedded as bash heredocs inside SETUP_SCRIPT; move the canonical sources to scripts/probes/raw_github_probe.py and scripts/probes/dns_probe.py and have run_setup read and substitute them in. Behaviour identical; SETUP_SCRIPT shrinks by ~140 lines. https://claude.ai/code/session_01BgToCb8eDGsrkyTddCtt9Z
d5c3e74 to
c8d10ef
Compare
H9: Spurious-RTO ratio (panel 18)
TCPSpuriousRTOs / TCPTimeouts. ~1.0 means most RTOs are spurious
(network was fine, kernel was impatient — rto_min tuning); ~0
means genuine packet loss.
H9: Fast recovery vs RTO (panel 19)
TCPSackRecovery + TCPRenoRecovery (cheap, cwnd preserved) vs
TCPTimeouts (expensive, cwnd collapsed). When RTO dominates the
loss pattern is severe/bursty.
H10: softnet drops & NAPI squeezes per CPU (panel 20)
rate(node_softnet_dropped_total) and _times_squeezed_total. If
times_squeezed > 0 on the CPU that handles end0 IRQs, the NAPI
poll exhausted its budget and the NIC dropped at the ring buffer
before TCP saw it — connects H10 (CPU0 softirq saturation) to
H9 (downstream TCP timeouts).
https://claude.ai/code/session_01BgToCb8eDGsrkyTddCtt9Z
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.