Skip to content

Claude/debug network performance etj9 w#20

Closed
luhenry wants to merge 10 commits into
mainfrom
claude/debug-network-performance-Etj9W
Closed

Claude/debug network performance etj9 w#20
luhenry wants to merge 10 commits into
mainfrom
claude/debug-network-performance-Etj9W

Conversation

@luhenry
Copy link
Copy Markdown
Contributor

@luhenry luhenry commented May 10, 2026

No description provided.

luhenry added 9 commits May 10, 2026 20:30
These cover the two metrics in the debugging plan that no built-in
node_exporter collector exposes:

  raw_github_probe.py (5 min timer)
    Downloads a fixed artefact from raw.githubusercontent.com (Fastly,
    customer's actual path) and from speed.cloudflare.com (non-Fastly
    control). curl is shelled out so we get %{remote_ip} reflecting the
    IP libcurl actually connected to. Emits raw_github_probe_seconds,
    _bytes_per_second and _curl_exit_code labelled by target and
    remote_ip. When the Fastly target sags but the Cloudflare control
    stays flat, the issue is on the Scaleway-Fastly path (H9), not
    Scaleway WAN egress generally (H5).

  dns_probe.py (60 s timer)
    Resolves raw.githubusercontent.com via socket.getaddrinfo (same
    resolver libc/curl uses) and emits an info-style series
    runner_dns_resolved_ip{ip="..."}=1 plus a count. Lets us correlate
    slow windows with cache-region or POP flips (H1).

Both scripts run as the node_exporter user, write atomically (mkstemp +
os.replace) to the textfile collector dir, and self-recover from probe
failures by emitting curl_exit_code=99 instead of skipping the row.
Stdlib only — no extra apt packages needed.

https://claude.ai/code/session_01BgToCb8eDGsrkyTddCtt9Z
Ten panels organized by hypothesis (H10, H9, sentinels), in the
Grafana v2alpha dashboard schema (kind/spec wrappers, AutoGridLayout,
elements map keyed by panel-N) used by Scaleway's managed Grafana.

  H10 (single-queue NIC / CPU0 softirq saturation)
    - CPU softirq% by CPU                 (default cpu collector)
    - NET_RX softirq deliveries/sec       (--collector.softirqs)
    - end0 hard IRQs/sec by CPU           (--collector.interrupts)

  H9 (TCP-stack health, Scaleway-Fastly path proxy)
    - TCPTimeouts/sec
    - Retransmit ratio with 0.5% threshold
    - Lost retransmits & spurious RTOs

  Sentinels
    - end0 throughput vs 100 Mbps line
    - end0 NIC drops & errors             (H2 should stay flat)
    - Conntrack utilisation               (H3 should stay << 0.01)
    - EEE / LPI enter rate                (H7 fallback; --collector.ethtool)

Datasource UID hardcoded to fflnugavx2h34c (Scaleway Cockpit metrics
data source for this project). Layout is AutoGridLayout with three
columns; panels flow into four rows. No template variables yet —
adding a node selector requires a v2-schema sample.

https://claude.ai/code/session_01BgToCb8eDGsrkyTddCtt9Z
- panel-11 (Sentinel: end0 TX drops & errors) mirrors panel-8 against
  node_network_transmit_{drop,errs}_total. Layout puts panels 7, 8, 11
  in the same row so throughput, RX errs, and TX errs sit visually
  grouped under one NIC sentinel band.
- panel-8 renamed to "RX drops & errors" to pair cleanly with the new
  TX panel.
- New `node` query variable (multi, includeAll) using
  label_values(node) so the dashboard can be filtered per-node. Every
  panel query now selects on node=~"$node" so the variable actually
  scopes results.

https://claude.ai/code/session_01BgToCb8eDGsrkyTddCtt9Z
Two reasons panels in the dashboard were empty:

1. Dashboard query typo. The softirq panel queried node_softirq_total
   (singular) but node_exporter exposes node_softirqs_total (plural).

2. node_exporter's --collector.netstat.fields default regex excludes
   TCPLostRetransmit and TCPSpuriousRTOs (and a number of other
   TcpExt_* fields). The collector reads them from /proc/net/netstat
   but drops them before exposing. Setting the filter to ^.*$ exposes
   the full set; cardinality bump on a runner is a few dozen series
   per node, negligible.

The remaining two empty panels (NET_RX softirq cross-check, EEE/LPI)
are expected to start populating once a runner is reprovisioned with
this scw.py — verifiable on the node via
  curl -s 127.0.0.1:9100/metrics | grep -E '^node_(softirqs|interrupts|ethtool)_'

https://claude.ai/code/session_01BgToCb8eDGsrkyTddCtt9Z
The light_dwmac_eth driver names the counter
irq_tx_path_in_lpi_mode_n, but st_gmac (the driver on the runner that
exported the metrics) names it irq_transmitted_path_in_lpi_mode_n.
The H7 sentinel panel was querying the light_dwmac form and getting
no data. Use the st_gmac form, which is the form actually exposed
on the fleet.

https://claude.ai/code/session_01BgToCb8eDGsrkyTddCtt9Z
Two unrelated query fixes against current node_exporter (1.11.1):

1. node_exporter renamed the softirqs metric to
   node_softirqs_functions_total. The H10 NET_RX panel was querying
   the old node_softirqs_total. (Raw data already confirms H10:
   CPU0 has 1.55M NET_RX softirqs vs ~5K on each other CPU.)

2. The interrupts collector exposes the device name under the
   `devices` label, not `type` (which is the IRQ number). Filter
   end0 IRQs via devices=~".*end0.*"; legend now shows cpu, dev,
   and irq for clarity.

https://claude.ai/code/session_01BgToCb8eDGsrkyTddCtt9Z
Six rows:
  H1  Bad Fastly POP / IPv6 path                  3 new panels (12-14)
  H5  Scaleway WAN egress contention              2 new panels (15-16)
  H7  EEE / LPI micro-stalls (fallback)           panel 10
  H9  TCP-stack health                            panels 4-6
  H10 Single-core CPU0 softirq saturation         panels 1-3
  Sentinels (H2/H3, refuted)                      panels 7, 8, 11, 9

H2 (NIC errors), H3 (conntrack), H4 (PMTU), H6 (ASN throttle), H8
(in-host contention) don't get their own rows: H2/H3 are sentinels,
H4/H6/H8 have no on-host Prometheus metric (live capture or off-host
probe only).

H1/H5 panels read raw_github_probe_seconds, _bytes_per_second
(faceted by remote_ip/target) and runner_dns_resolved_ip; they
populate once raw_github_probe.py and dns_probe.py are running on
the node, otherwise the panels are empty placeholders waiting for
the probe data.

Layout uses RowsLayout containing one RowsLayoutRow per hypothesis,
each wrapping an AutoGridLayout with maxColumnCount=3 — schema
matches the v2 sample exported by Scaleway-managed Grafana.

https://claude.ai/code/session_01BgToCb8eDGsrkyTddCtt9Z
Mirrors the existing NET_RX softirq panel against
node_softirqs_functions_total{type="NET_TX"}. Slots into the H10 row
between NET_RX softirq and end0 hard IRQs so RX/TX softirq pressure
sit side-by-side, with the hard-IRQ panel on the next visual row.

https://claude.ai/code/session_01BgToCb8eDGsrkyTddCtt9Z
The probes were embedded as bash heredocs inside SETUP_SCRIPT; move the
canonical sources to scripts/probes/raw_github_probe.py and
scripts/probes/dns_probe.py and have run_setup read and substitute
them in. Behaviour identical; SETUP_SCRIPT shrinks by ~140 lines.

https://claude.ai/code/session_01BgToCb8eDGsrkyTddCtt9Z
@luhenry luhenry force-pushed the claude/debug-network-performance-Etj9W branch from d5c3e74 to c8d10ef Compare May 10, 2026 20:31
  H9: Spurious-RTO ratio (panel 18)
    TCPSpuriousRTOs / TCPTimeouts. ~1.0 means most RTOs are spurious
    (network was fine, kernel was impatient — rto_min tuning); ~0
    means genuine packet loss.

  H9: Fast recovery vs RTO (panel 19)
    TCPSackRecovery + TCPRenoRecovery (cheap, cwnd preserved) vs
    TCPTimeouts (expensive, cwnd collapsed). When RTO dominates the
    loss pattern is severe/bursty.

  H10: softnet drops & NAPI squeezes per CPU (panel 20)
    rate(node_softnet_dropped_total) and _times_squeezed_total. If
    times_squeezed > 0 on the CPU that handles end0 IRQs, the NAPI
    poll exhausted its budget and the NIC dropped at the ring buffer
    before TCP saw it — connects H10 (CPU0 softirq saturation) to
    H9 (downstream TCP timeouts).

https://claude.ai/code/session_01BgToCb8eDGsrkyTddCtt9Z
@luhenry luhenry closed this May 10, 2026
@luhenry luhenry deleted the claude/debug-network-performance-Etj9W branch May 11, 2026 14:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant