Lets just check #22

acooks · 2025-11-29T10:21:17Z

No description provided.

Applied automatic formatting from cargo fmt to address formatting issues identified during pre-checkin validation. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>

… FD passing Replace fragile assumption that FD 4 is the stats pipe with secure environment variable-based FD passing mechanism. Changes: - Supervisor: Pass stats pipe FD via MCR_STATS_PIPE_FD env var instead of dup2 to FD 4 - Supervisor: Clear FD_CLOEXEC flag to allow FD inheritance across exec() - Worker: Read FD from environment variable instead of blindly using FD 4 - Worker: Validate FD with fcntl() before use - Worker: Gracefully handle missing FD (returns None for control plane) Security improvements: - No hardcoded FD assumptions - Explicit FD validation before use - Clear ownership and communication via environment variable - Graceful degradation when FD is not available 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>

The test was failing because it assumed eth0 interface would exist, but workers need an IPv4 address from output interfaces to create UDP sockets. Fixed by: 1. Running in isolated NetworkNamespace to prevent test interference 2. Creating veth pair with unique UUID-based name to avoid collisions 3. Assigning IP 10.99.99.1/24 to veth interface 4. Using veth as output interface instead of hardcoded eth0 This ensures concurrent tests don't interfere and matches the pattern used by other integration tests (test_basic, test_topologies). The veth pair is automatically cleaned up when it goes out of scope. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>

This commit addresses test infrastructure improvements to prevent bitrot and ensure all tests run regularly in CI: 1. GitHub Actions workflow (.github/workflows/rust.yml): - Fix branch name from 'master' to 'main' - Split test execution into unit tests and integration tests - Add sudo support for integration tests (requires CAP_NET_ADMIN) - Set 6-minute timeout for integration tests - Remove outdated '--skip privileged::' filter 2. Justfile test targets: - Consolidate integration test targets into single 'test-integration' - Remove 'test-integration-light' (all integration tests need root) - Update 'test-fast' to only run unit tests - Add backward compatibility alias 'test-integration-privileged' 3. Integration tests (tests/integration/rule_management.rs): - Remove #[ignore] from test_max_workers_spawning - Test now runs in normal test suite (previously required --ignored flag) - Update documentation to clarify EAGAIN errors are expected - Demonstrates supervisor resilience under resource pressure All integration tests now run in isolated network namespaces using unique veth interface names to prevent concurrent test interference. Tests verified passing locally before this commit. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>

This establishes a multi-tier continuous testing strategy to prevent performance regressions and ensure comprehensive coverage: **Performance Workflow** (.github/workflows/performance.yml): - Runs on push to main branch - Quick performance test (100 packets, ~2 seconds) - Full performance suite with kernel tuning (~15 minutes) - Uploads performance metrics as artifacts for trend analysis **Nightly Workflow** (.github/workflows/nightly.yml): - Runs daily at 2 AM UTC (cron schedule) - Matrix strategy runs 4 topology tests in parallel: * baseline_50k * baseline_100k * chain_3hop * tree_fanout - Full integration stress test suite - Aggregates all metrics into summary report - Fails build if any topology test fails **Benefits:** - Catch performance regressions before merge - Continuous baseline measurements - Comprehensive topology coverage without slowing PRs - Historical performance data via artifacts - Manual trigger capability (workflow_dispatch) Together with existing PR validation (unit + integration tests), this provides complete continuous testing coverage. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>

Use dual veth pair topology with separate ingress (vhin/vnin) and egress (vhout/vnout) interfaces to avoid same-interface rejection. The relay correctly rejects rules where input and output interfaces are identical to prevent packet loops.

Supervisor was using fire-and-forget tokio::spawn() for shutdown commands, then immediately waiting for workers without ensuring commands were sent. Workers were killed with SIGTERM before receiving shutdown command, so final stats were never printed. Changes: - Supervisor: Await shutdown command sends with timeout, add 500ms grace period - Supervisor: Add process group isolation (.process_group(0)) for workers - Tests: Add graceful_cleanup_namespace() to send SIGTERM only to supervisors - Tests: Update all baseline tests to use graceful cleanup This fixes the stats reporting bug where tests showed ~90% delivery when actual delivery was 99.7% (final stats vs last periodic stats).

Add graceful_cleanup_unshare() function for tests using ephemeral namespaces (unshare --net). Update chain_3hop and tree_fanout tests to use graceful cleanup pattern. Changes: - common.sh: Add graceful_cleanup_unshare() for ephemeral namespaces - chain_3hop.sh: Use graceful cleanup with 3 MCR instances - tree_fanout.sh: Use graceful cleanup with 4 MCR instances This ensures all topology tests can capture final stats on shutdown, revealing true packet counts vs last periodic stats.

This commit fixes two related issues with stats reporting in baseline tests that prevented accurate final statistics from being captured: 1. Updated get_stats() in common.sh to look for "[STATS:(Ingress|Egress) FINAL]" pattern instead of "[STATS:(Ingress|Egress)]", matching the actual format produced by graceful shutdown. 2. Added explicit graceful shutdown before validation in all baseline tests (baseline_50k.sh, baseline_100k.sh, baseline_100k_60s.sh, baseline_150k_60s.sh). Previously, tests relied on EXIT trap for cleanup, which meant: - Traffic generation completed - Validation ran (reading incomplete periodic stats) - Test exited - EXIT trap triggered graceful shutdown and FINAL stats printing (too late!) Now tests explicitly: - Send SIGTERM to supervisors after traffic completes - Wait 2s for workers to print FINAL stats - Run validation (with accurate FINAL counts) - EXIT trap only cleans up namespace (no processes remaining) This ensures validation uses accurate final packet counts instead of stale periodic statistics from before shutdown began. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>

The default socket buffer (~212KB) only provides ~1.5ms of burst tolerance at 100k pps, causing packet drops when io_uring processing latency exceeds this. Request 16MB buffer (actual size depends on net.core.rmem_max sysctl). With proper kernel tuning (rmem_max=16MB via scripts/setup_kernel_tuning.sh), this achieves 0% packet loss at 100k pps in testing. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>

Add configurable stats reporting interval via MCR_STATS_INTERVAL_MS env var. Default behavior (0) uses packet-count based reporting every 10k packets. Set to e.g. 100 for 100ms time-based intervals useful for debugging. Stats output now includes: - t_ms: elapsed time since start for time-series analysis - buf_exhaust: buffer pool exhaustion counter Also makes num_recv_buffers configurable via MCR_NUM_RECV_BUFFERS. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>

Only set IP_MULTICAST_IF when the destination is actually a multicast address (224.0.0.0/4). This allows the traffic generator to be used for unicast testing as well. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>

- Add wait_for_bridge_forwarding() to ensure bridge ports reach forwarding state before sending traffic (avoids STP-related drops) - Use sudo -E to preserve environment variables like MCR_STATS_INTERVAL_MS - Fix MCR-2 core_id from 1 to 2 to avoid potential CPU affinity conflicts - Improve shutdown ordering: kill sender (MCR-1) first and wait for it to drain, then kill receiver (MCR-2) for reliable final stats capture - Make test parameters configurable via environment variables for sweeps 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>

Document plan for replacing AF_PACKET + io_uring recv() with PACKET_MMAP (TPACKET_V3) for zero-copy packet receive. This would eliminate the socket buffer bottleneck entirely. Covers: - Current vs target architecture - Implementation phases - Integration with unified loop - PACKET_FANOUT compatibility - Testing and rollout strategy 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>

The send_queue was using Vec with remove(0) which is O(n), causing severe performance degradation during fanout scenarios where the queue grows large. This fix: - Changes send_queue from Vec to VecDeque for O(1) front removal - Uses while loop to drain queue completely each iteration (was single if) - Returns batch count from submit_send_batch() to detect SQ-full This enables 100k pps ingress → 300k pps egress with 3x fanout (was achieving only ~46% efficiency before due to queue backup). 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>

Adds a new topology test that validates head-end replication with 50 outputs. This exercises the VecDeque-based send queue under high traffic amplification (10k input -> 500k output packets). 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>

The tree_fanout.sh test has been implemented and is actively passing in the test suite. Update documentation to reflect actual status. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>

- Add section 3 for high_fanout_50.sh (1:50 fanout test) - Renumber planned tests (tree_converge, diamond, mesh) - Add new "Baseline Performance Tests" section documenting: - baseline_50k.sh (50k pps validation) - baseline_100k.sh (100k pps validation) - baseline_100k_60s.sh (60s profiling at 100k pps) - baseline_150k_60s.sh (60s profiling at 150k pps) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>

The start_mcr function accepts an optional core_id parameter (5th arg) for CPU affinity, but this was missing from the documentation. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>

Remove dead code that was superseded by graceful_cleanup_unshare: - cleanup_mcr_processes() - used killall (host-wide, unsafe) - cleanup_sockets() - removed socket files manually - cleanup_all() - wrapper for above two All tests now use graceful_cleanup_unshare which sends SIGTERM to specific MCR PIDs, allowing graceful shutdown with final stats. Update README.md to document the correct cleanup function. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>

The high_fanout_50.sh test validates 1:50 head-end replication, exercising the VecDeque-based send queue under high amplification. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>

- Performance benchmarking mode: baseline_*_60s.sh tests - Long-duration stress tests: 60-second profiling tests - CI/CD integration: GitHub Actions workflows 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>

Eliminate ~70% code duplication across 4 baseline test scripts by creating a unified baseline_test.sh with configurable parameters. Changes: - Add baseline_test.sh with --rate, --packets, --profiling options - Convert baseline_50k.sh to thin wrapper (50k pps, 100k packets) - Convert baseline_100k.sh to thin wrapper (100k pps, 100k packets) - Convert baseline_100k_60s.sh to thin wrapper (100k pps, 6M packets, profiling) - Convert baseline_150k_60s.sh to thin wrapper (150k pps, 9M packets, profiling) Fixes addressed: - Consistent sequential shutdown (MCR-1 first, wait, then MCR-2) - Consistent CPU core assignment (MCR-1: core 0, MCR-2: core 1) - Unified validation thresholds (95% for <=100k pps, 80% for >100k pps) - Environment variable support for all test parameters Code reduction: 768 lines -> 250 lines (67% reduction) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>

Remove redundant test scripts: - baseline_50k.sh - redundant with baseline_100k.sh - baseline_100k_60s.sh - redundant with baseline_150k_60s.sh Update nightly CI workflow: - Remove baseline_50k from matrix (deleted) - Add high_fanout_50 to matrix - Add sustained-performance job (150k pps, 60s) for performance regression Update documentation to reflect simplified test suite. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>

Add 5 new topology tests covering previously untested scenarios: - tree_converge.sh: N:1 convergence with multiple traffic sources - multi_worker.sh: PACKET_FANOUT with multiple workers - fault_tolerance.sh: Graceful shutdown and SIGTERM handling - edge_cases.sh: Min/max packet sizes and buffer pool stress - dynamic_rules.sh: Runtime rule addition and concurrent rules All tests use network namespace isolation and include proper shellcheck compliance. Updated nightly CI workflow to run these tests and updated README documentation. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>

- Add #[allow(dead_code)] to log_pipe and stats_pipe fields in Worker struct (reserved for future log/stats aggregation features) - Apply cargo fmt to unified_loop.rs and rule_management.rs 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>

.github/workflows/nightly.yml

.github/workflows/performance.yml

…n permissions Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com>

Reduced from 1315 lines to 245 lines by: - Removing 13 completed items (moved to "Archived" section) - Keeping 15 pending items organized by priority - Verifying all claims against actual code Corrections made during verification: - Log level control tests: Already fixed (was incorrectly listed as broken) - Periodic health checks: Already implemented at 250ms intervals - Updated line references to match current code Remaining work: 2 critical, 5 high, 5 medium, 3 low priority items 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>

- Delete redundant ARCHITECTURE_EVALUATION.md (659 lines) - Add concise "Why PACKET_FANOUT_CPU" section to ARCHITECTURE.md - Fix MD040 errors: add language specifiers to code blocks - Fix MD036 errors: convert emphasis-as-heading to proper headings - Auto-fix formatting: add required blank lines Net reduction of 511 lines while preserving key architectural rationale. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>

…check

acooks and others added 26 commits November 27, 2025 21:41

style: Apply cargo fmt formatting

e975dd3

Applied automatic formatting from cargo fmt to address formatting issues identified during pre-checkin validation. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>

github-advanced-security bot found potential problems Nov 29, 2025

View reviewed changes

acooks and others added 3 commits November 29, 2025 20:31

Potential fix for code scanning alert no. 5: Workflow does not contai…

a0596d7

…n permissions Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com>

Potential fix for code scanning alert no. 7: Workflow does not contai…

7b2cd3d

…n permissions Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com>

acooks and others added 2 commits November 29, 2025 20:54

Merge remote-tracking branch 'origin/lets-just-check' into lets-just-…

5b40a86

…check

acooks merged commit 22c1b08 into main Nov 29, 2025
5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Lets just check #22

Lets just check #22

Uh oh!

acooks commented Nov 29, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Lets just check #22

Lets just check #22

Uh oh!

Conversation

acooks commented Nov 29, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants