Skip to content

Commit c4d2467

Browse files
JohnMcLearclaude
andauthored
test(ci): OS-level sidecar watcher for the Windows silent ELIFECYCLE (#7846)
In-process diagnostics (diagnostics.ts heartbeat at 5 Hz + node-report snapshots on every beforeEach and heartbeat tick) merged in #7838 and #7842 reach a hard ceiling: during every captured death window the V8 main isolate is event-loop-starved for 200-400 ms before the process is externally terminated, so any timer-driven probe (heartbeat, setTimeout, --report-on-signal handler) never gets serviced and we have zero JS-visible state from the actual moment of death. To capture state during the starvation window we need a probe whose own scheduling does not depend on the dying process's libuv event loop. This commit adds a tiny bash background loop to the Windows backend-test steps (both with- and without-plugins). Every 500 ms it appends: - netstat.log: localhost TCP socket state — surfaces TIME_WAIT / CLOSE_WAIT accumulation or ephemeral-port exhaustion that the in-process libuv handle list can't see (libuv only shows handles Node currently knows about; the kernel may hold many more sockets in disposal states). - tasklist.log: node.exe process state from the Windows OS view (handle count, working set, CPU time), independent of whether V8 is responsive. Both files land in $GITHUB_WORKSPACE/node-report/ which is already the artifact-upload target on failure, so they ride for free on existing infrastructure. The watcher is killed cleanly after `pnpm test` returns so it never holds the runner open. On the next captured silent ELIFECYCLE we'll have, for the first time, a 500 ms-resolution external observation of TCP and process state across the death window. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent 118d48c commit c4d2467

1 file changed

Lines changed: 74 additions & 0 deletions

File tree

.github/workflows/backend-tests.yml

Lines changed: 74 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -222,11 +222,48 @@ jobs:
222222
NODE_OPTIONS: "--report-on-fatalerror --report-uncaught-exception --report-on-signal --report-compact --report-directory=${{ github.workspace }}/node-report"
223223
run: |
224224
mkdir -p "${{ github.workspace }}/node-report"
225+
OUT="${{ github.workspace }}/node-report"
226+
# Out-of-process OS-level watcher for the silent-ELIFECYCLE flake.
227+
# In-process diagnostics (diagnostics.ts heartbeat + node-report
228+
# snapshots) showed that during the death window the V8 main
229+
# isolate is starved — heartbeat stops firing entirely, then the
230+
# process is externally terminated, bypassing all JS handlers and
231+
# Node's --report-on-fatalerror. To capture state during that
232+
# starvation we need a process that doesn't depend on the dying
233+
# process's event loop. A bash background loop polling Windows
234+
# OS state every 500 ms gives us that:
235+
# - netstat.log: localhost TCP socket states over time
236+
# (TIME_WAIT/CLOSE_WAIT accumulation, handle exhaustion)
237+
# - tasklist.log: node.exe process handle count, working set,
238+
# CPU time — captured by the OS independent of V8.
239+
# Both logs are appended to node-report/ which already gets
240+
# uploaded as an artifact on failure.
241+
(
242+
while true; do
243+
ts=$(date '+%H:%M:%S.%3N')
244+
{
245+
echo "=== $ts ==="
246+
netstat -an 2>/dev/null | grep -E "TCP\s+(127\.0\.0\.1|\[::1\])" || true
247+
} >> "$OUT/netstat.log"
248+
{
249+
echo "=== $ts ==="
250+
tasklist /v /fi "imagename eq node.exe" /fo csv 2>/dev/null || true
251+
} >> "$OUT/tasklist.log"
252+
sleep 0.5
253+
done
254+
) &
255+
WATCHER_PID=$!
225256
# --exit forces process.exit(failures) after the suite completes,
226257
# closing the post-suite event-loop drain window where Windows +
227258
# Node 24 hard-kills the process. Scoped to Windows so Linux/local
228259
# runs still surface real handle leaks via natural drain.
260+
set +e
229261
pnpm test -- --exit
262+
EXIT=$?
263+
set -e
264+
kill "$WATCHER_PID" 2>/dev/null || true
265+
wait "$WATCHER_PID" 2>/dev/null || true
266+
exit $EXIT
230267
- name: Upload Node diagnostic reports on failure
231268
if: ${{ failure() }}
232269
uses: actions/upload-artifact@v7
@@ -319,11 +356,48 @@ jobs:
319356
NODE_OPTIONS: "--report-on-fatalerror --report-uncaught-exception --report-on-signal --report-compact --report-directory=${{ github.workspace }}/node-report"
320357
run: |
321358
mkdir -p "${{ github.workspace }}/node-report"
359+
OUT="${{ github.workspace }}/node-report"
360+
# Out-of-process OS-level watcher for the silent-ELIFECYCLE flake.
361+
# In-process diagnostics (diagnostics.ts heartbeat + node-report
362+
# snapshots) showed that during the death window the V8 main
363+
# isolate is starved — heartbeat stops firing entirely, then the
364+
# process is externally terminated, bypassing all JS handlers and
365+
# Node's --report-on-fatalerror. To capture state during that
366+
# starvation we need a process that doesn't depend on the dying
367+
# process's event loop. A bash background loop polling Windows
368+
# OS state every 500 ms gives us that:
369+
# - netstat.log: localhost TCP socket states over time
370+
# (TIME_WAIT/CLOSE_WAIT accumulation, handle exhaustion)
371+
# - tasklist.log: node.exe process handle count, working set,
372+
# CPU time — captured by the OS independent of V8.
373+
# Both logs are appended to node-report/ which already gets
374+
# uploaded as an artifact on failure.
375+
(
376+
while true; do
377+
ts=$(date '+%H:%M:%S.%3N')
378+
{
379+
echo "=== $ts ==="
380+
netstat -an 2>/dev/null | grep -E "TCP\s+(127\.0\.0\.1|\[::1\])" || true
381+
} >> "$OUT/netstat.log"
382+
{
383+
echo "=== $ts ==="
384+
tasklist /v /fi "imagename eq node.exe" /fo csv 2>/dev/null || true
385+
} >> "$OUT/tasklist.log"
386+
sleep 0.5
387+
done
388+
) &
389+
WATCHER_PID=$!
322390
# --exit forces process.exit(failures) after the suite completes,
323391
# closing the post-suite event-loop drain window where Windows +
324392
# Node 24 hard-kills the process. Scoped to Windows so Linux/local
325393
# runs still surface real handle leaks via natural drain.
394+
set +e
326395
pnpm test -- --exit
396+
EXIT=$?
397+
set -e
398+
kill "$WATCHER_PID" 2>/dev/null || true
399+
wait "$WATCHER_PID" 2>/dev/null || true
400+
exit $EXIT
327401
- name: Upload Node diagnostic reports on failure
328402
if: ${{ failure() }}
329403
uses: actions/upload-artifact@v7

0 commit comments

Comments
 (0)