|
1 | | -# Network Stress Test |
| 1 | +# Broadcast Network Stress Test Node |
2 | 2 |
|
3 | | -## Setup and Run Stress Test |
| 3 | +A comprehensive network stress testing tool for the Apollo network that tests P2P communication, measures performance metrics, and validates network behavior under various load patterns and conditions. |
4 | 4 |
|
5 | | -1. **Create Remote Engines** |
| 5 | +## Overview |
6 | 6 |
|
7 | | - Create 5 gcloud VM instances. Make sure to have the necessary RAM and disk space. Each instance should be named in the following pattern: |
| 7 | +The broadcast network stress test node is designed to stress test the P2P communication layer of the Apollo network. It creates a network of nodes with configurable broadcasting patterns, measuring latency, throughput, message ordering, and overall network performance. The tool supports both local testing (using the provided Python scripts) and distributed deployment via Kubernetes with optional network throttling. |
8 | 8 |
|
9 | | - ``` |
10 | | - <instance-name>-0, ... ,<instance-name>-4 |
11 | | - ``` |
| 9 | +## Features |
12 | 10 |
|
13 | | -2. **Set Bootstrap Node** |
| 11 | +- **Multiple Broadcasting Modes**: Supports different message broadcasting patterns (all nodes, single broadcaster, round-robin) |
| 12 | +- **Advanced Performance Metrics**: Measures message latency, throughput, delivery rates, ordering, and duplicate detection |
| 13 | +- **Message Ordering Analysis**: Tracks out-of-order messages, missing messages, and duplicates |
| 14 | +- **Prometheus Integration**: Exports detailed metrics with proper labels for monitoring and analysis |
| 15 | +- **Network Throttling**: Supports bandwidth and latency gating for realistic network conditions |
| 16 | +- **Configurable Parameters**: Customizable message sizes, send intervals, buffer sizes, and test duration |
| 17 | +- **Multi-Node Support**: Can run multiple coordinated nodes with different broadcasting patterns |
| 18 | +- **Kubernetes Deployment**: Includes YAML templates for cluster deployment with traffic shaping |
| 19 | +- **Deterministic Peer IDs**: Generates consistent peer identities for reproducible tests |
14 | 20 |
|
15 | | - Find the internal IP of your bootstrap node in the VM instances chart on google cloud console. Paste it into the test_config.json file into the bootstrap_peer_multaddr value instead of its placeholder. |
| 21 | +## Building |
16 | 22 |
|
17 | | -3. **Install Rust and clone repository** |
| 23 | +Build the stress test node binary: |
18 | 24 |
|
19 | | - For all 5 instances run: |
| 25 | +```bash |
| 26 | +cargo build --release --bin broadcast_network_stress_test_node |
| 27 | +``` |
20 | 28 |
|
21 | | - ``` |
22 | | - gcloud compute ssh <instance-name>-0 --project <project-name> -- 'cd <path-to-repo> && sudo apt install -y git unzip clang && curl https://sh.rustup.rs -sSf | sh -s -- -y && source "$HOME/.cargo/env" && git clone https://github.com/starkware-libs/sequencer.git; cd sequencer && sudo scripts/dependencies.sh cargo build --release -p apollo_network --bin network_stress_test' |
23 | | - ``` |
| 29 | +## Command Line Arguments |
24 | 30 |
|
25 | | -4. **Run test** |
| 31 | +| Argument | Description | Default | Environment Variable | |
| 32 | +|----------|-------------|---------|---------------------| |
| 33 | +| `--id` | Node ID for identification and metrics | Required | `ID` | |
| 34 | +| `--num-nodes` | Total number of nodes in the network | 3 | `NUM_NODES` | |
| 35 | +| `--metric-port` | Prometheus metrics server port | 2000 | `METRIC_PORT` | |
| 36 | +| `--p2p-port` | P2P network port | 10000 | `P2P_PORT` | |
| 37 | +| `--bootstrap` | Bootstrap peer addresses (comma-separated) | None | `BOOTSTRAP` | |
| 38 | +| `--verbosity` | Log verbosity (0-5: None, ERROR, WARN, INFO, DEBUG, TRACE) | 2 | `VERBOSITY` | |
| 39 | +| `--buffer-size` | Broadcast topic buffer size | 10000 | `BUFFER_SIZE` | |
| 40 | +| `--message-size-bytes` | Message payload size in bytes | 1024 | `MESSAGE_SIZE_BYTES` | |
| 41 | +| `--heartbeat-millis` | Interval between messages (milliseconds) | 1 | `HEARTBEAT_MILLIS` | |
| 42 | +| `--mode` | Broadcasting mode: `all`, `one`, or `rr` | `all` | `MODE` | |
| 43 | +| `--broadcaster` | Node ID for broadcasting (OneBroadcast mode) | 1 | `BROADCASTER` | |
| 44 | +| `--round-duration-seconds` | Duration per node in RoundRobin mode | 3 | `ROUND_DURATION_SECONDS` | |
26 | 45 |
|
27 | | - ``` |
28 | | - PROJECT_ID=<project-name> BASE_INSTANCE_NAME=<instance-name> ZONE=<zone> ./run_broadcast_stress_test.sh |
29 | | - ``` |
| 46 | +## Broadcasting Modes |
30 | 47 |
|
31 | | -5. **Results** |
| 48 | +### All Broadcast (`all`) |
| 49 | +All nodes continuously broadcast messages simultaneously. Best for testing network capacity and concurrent message handling. |
32 | 50 |
|
33 | | - Results are retrieved from VM instances and saved to /output.csv. You can change the default path by adjusting the config file. |
| 51 | +### Single Broadcaster (`one`) |
| 52 | +Only the node specified by `--broadcaster` sends messages, while others act as receivers. Ideal for testing message propagation and network topology. |
34 | 53 |
|
35 | | -## Pull repo updates to virtual machines |
| 54 | +### Round Robin (`rr`) |
| 55 | +Nodes take turns broadcasting in sequential order based on their ID. Each node broadcasts for `--round-duration-seconds` before passing to the next. Useful for testing network behavior under changing load patterns. |
36 | 56 |
|
37 | | -1. **Run** |
| 57 | +## Running Locally |
38 | 58 |
|
39 | | - ``` |
40 | | - PROJECT_ID=<project-name> BASE_INSTANCE_NAME=<instance-name> ZONE=<zone> ./pull_stress_test.sh |
41 | | - ``` |
| 59 | +### Recommended: Multi-Node Network using Local Script |
| 60 | + |
| 61 | +The best way to run locally is using the local script. First, navigate to the run directory: |
| 62 | + |
| 63 | +```bash |
| 64 | +cd crates/apollo_network/src/bin/broadcast_network_stress_test_node/run |
| 65 | +python local.py --num-nodes 3 --verbosity 3 --mode rr |
| 66 | +``` |
| 67 | + |
| 68 | +This will: |
| 69 | +- Compile the binary if needed |
| 70 | +- Start 3 nodes with sequential ports (10000, 10001, 10002) |
| 71 | +- Automatically configure bootstrap peers for all nodes |
| 72 | +- Launch Prometheus in Docker for metrics collection |
| 73 | +- Provide a web interface at http://localhost:9090 |
| 74 | + |
| 75 | +### Manual Single Node (Advanced) |
| 76 | + |
| 77 | +For direct binary testing (not recommended for most use cases): |
| 78 | + |
| 79 | +```bash |
| 80 | +./target/release/broadcast_network_stress_test_node \ |
| 81 | + --id 0 \ |
| 82 | + --metric-port 2000 \ |
| 83 | + --p2p-port 10000 \ |
| 84 | + --verbosity 3 \ |
| 85 | + --mode all |
| 86 | +``` |
| 87 | + |
| 88 | +### Advanced Local Testing |
| 89 | + |
| 90 | +All commands should be run from the run directory: |
| 91 | + |
| 92 | +```bash |
| 93 | +cd crates/apollo_network/src/bin/broadcast_network_stress_test_node/run |
| 94 | + |
| 95 | +# Test round-robin mode with custom timing |
| 96 | +python local.py --num-nodes 5 --mode rr --round-duration-seconds 10 --heartbeat-millis 100 |
| 97 | + |
| 98 | +# Test single broadcaster mode |
| 99 | +python local.py --num-nodes 3 --mode one --broadcaster 0 --message-size-bytes 4096 |
| 100 | +``` |
| 101 | + |
| 102 | +## Kubernetes Deployment |
| 103 | + |
| 104 | +### Prerequisites |
| 105 | + |
| 106 | +- Kubernetes cluster access |
| 107 | +- Docker registry access |
| 108 | +- kubectl configured |
| 109 | + |
| 110 | +### Deploy to Cluster |
| 111 | + |
| 112 | +```bash |
| 113 | +cd crates/apollo_network/src/bin/broadcast_network_stress_test_node/run |
| 114 | +python cluster_start.py --num-nodes 5 --latency 50 --throughput 1000 --mode rr |
| 115 | +``` |
| 116 | + |
| 117 | +This will: |
| 118 | +- Build and push a Docker image |
| 119 | +- Create Kubernetes StatefulSet with 5 nodes |
| 120 | +- Apply network throttling (50ms latency, 1000 KB/s throughput) |
| 121 | +- Deploy to a timestamped namespace |
| 122 | + |
| 123 | +### Access Prometheus |
| 124 | + |
| 125 | +```bash |
| 126 | +python cluster_port_forward_prometheus.py |
| 127 | +``` |
| 128 | + |
| 129 | +Then visit http://localhost:9090 for metrics visualization. |
| 130 | + |
| 131 | +### Cleanup |
| 132 | + |
| 133 | +```bash |
| 134 | +python cluster_stop.py |
| 135 | +``` |
| 136 | + |
| 137 | +## Network Throttling |
| 138 | + |
| 139 | +The Docker deployment supports network traffic shaping to simulate realistic network conditions: |
| 140 | + |
| 141 | +- **Latency Gating**: Add artificial delay to packets (via `LATENCY` environment variable in ms) |
| 142 | +- **Throughput Limiting**: Cap bandwidth to test under constrained conditions (via `THROUGHPUT` environment variable in KB/s) |
| 143 | + |
| 144 | +The entrypoint script uses Linux traffic control (`tc`) with HTB (Hierarchical Token Bucket) for bandwidth limiting and NetEm for latency simulation. |
| 145 | + |
| 146 | +## Metrics |
| 147 | + |
| 148 | +The tool exports comprehensive Prometheus metrics with proper labels: |
| 149 | + |
| 150 | +### Message Flow Metrics |
| 151 | +- `messages_sent_total`: Total messages sent by this node |
| 152 | +- `messages_received_total`: Total messages received (with `sender_id` label) |
| 153 | +- `bytes_received_total`: Total bytes received across all messages |
| 154 | + |
| 155 | +### Performance Metrics |
| 156 | +- `message_delay_seconds`: End-to-end message latency histogram (with `sender_id` label) |
| 157 | + |
| 158 | +### Message Ordering Metrics |
| 159 | +- `messages_out_of_order_total`: Messages received out of sequence (with `sender_id` label) |
| 160 | +- `messages_missing_total`: Messages that appear to be missing (with `sender_id` label) |
| 161 | +- `messages_duplicate_total`: Duplicate messages detected (with `sender_id` label) |
| 162 | +- `messages_missing_retrieved_total`: Previously missing messages that arrived late (with `sender_id` label) |
| 163 | + |
| 164 | +All metrics include appropriate labels for per-sender analysis, enabling detailed network behavior study. |
| 165 | + |
| 166 | +## Configuration |
| 167 | + |
| 168 | +### Message Structure |
| 169 | + |
| 170 | +Each stress test message contains: |
| 171 | +- **Sender ID**: Node identifier (8 bytes) |
| 172 | +- **Message Index**: Sequential message number from sender (8 bytes) |
| 173 | +- **Timestamp**: Send time as nanoseconds since UNIX epoch (16 bytes) |
| 174 | +- **Payload Length**: Size of variable payload (8 bytes) |
| 175 | +- **Payload**: Configurable data (remaining bytes) |
| 176 | + |
| 177 | +### Network Topology |
| 178 | + |
| 179 | +- All nodes join the same gossipsub topic: `stress_test_topic` |
| 180 | +- Node 0 typically acts as the bootstrap peer for network discovery |
| 181 | +- Deterministic peer IDs based on node ID ensure consistent network formation |
| 182 | +- Secret keys are generated deterministically from node ID for reproducibility |
| 183 | + |
| 184 | +## Example Use Cases |
| 185 | + |
| 186 | +### Latency Testing |
| 187 | +```bash |
| 188 | +# Test with 100ms network latency |
| 189 | +python cluster_start.py --num-nodes 3 --latency 100 --message-size-bytes 512 --mode all |
| 190 | +``` |
| 191 | + |
| 192 | +### Throughput Testing |
| 193 | +```bash |
| 194 | +# Test with 500 KB/s bandwidth limit |
| 195 | +python cluster_start.py --num-nodes 5 --throughput 500 --heartbeat-millis 10 --mode rr |
| 196 | +``` |
| 197 | + |
| 198 | +### Large Message Testing |
| 199 | +```bash |
| 200 | +# Test with 64KB messages in single broadcaster mode (run from the run directory) |
| 201 | +cd crates/apollo_network/src/bin/broadcast_network_stress_test_node/run |
| 202 | +python local.py --num-nodes 3 --message-size-bytes 65536 --heartbeat-millis 100 --mode one |
| 203 | +``` |
| 204 | + |
| 205 | +### Network Resilience Testing |
| 206 | +```bash |
| 207 | +# Test round-robin with constrained network |
| 208 | +python cluster_start.py --num-nodes 4 --latency 200 --throughput 100 --mode rr --round-duration-seconds 30 |
| 209 | +``` |
| 210 | + |
| 211 | +## Development |
| 212 | + |
| 213 | +### File Structure |
| 214 | + |
| 215 | +- `main.rs`: Core stress test logic, broadcasting modes, and coordination |
| 216 | +- `converters.rs`: Message serialization/deserialization with ordering support |
| 217 | +- `converters_test.rs`: Unit tests for message conversion |
| 218 | +- `utils.rs`: Configuration utilities and helper functions |
| 219 | +- `run/`: Deployment scripts and configurations |
| 220 | + - `local.py`: Local multi-node testing with Prometheus |
| 221 | + - `cluster_start.py`: Kubernetes deployment with throttling |
| 222 | + - `cluster_stop.py`: Cleanup deployed resources |
| 223 | + - `cluster_port_forward_prometheus.py`: Prometheus access helper |
| 224 | + - `yaml_maker.py`: Kubernetes YAML generation |
| 225 | + - `args.py`: Shared argument parsing for Python scripts |
| 226 | + - `utils.py`: Common utility functions |
| 227 | + - `Dockerfile`: Container image with traffic shaping capabilities |
| 228 | + - `entrypoint.sh`: Container startup script with network throttling |
| 229 | + - Various Kubernetes YAML templates |
| 230 | + |
| 231 | +### Adding New Metrics |
| 232 | + |
| 233 | +1. Import metrics crate: `use metrics::{counter, histogram, gauge};` |
| 234 | +2. Add metric recording in message handlers or broadcasting logic |
| 235 | +3. Use appropriate labels for detailed analysis |
| 236 | +4. Update Prometheus configuration in deployment scripts if needed |
| 237 | + |
| 238 | +### Adding New Broadcasting Modes |
| 239 | + |
| 240 | +1. Extend the `Mode` enum in `main.rs` |
| 241 | +2. Update the mode-specific logic in `send_stress_test_messages()` |
| 242 | +3. Add corresponding argument parsing in `args.py` |
| 243 | +4. Update documentation and examples |
| 244 | + |
| 245 | +### Network Configuration |
| 246 | + |
| 247 | +Modify `NetworkConfig` parameters in `main.rs` for different P2P behaviors: |
| 248 | +- Connection limits and timeouts |
| 249 | +- Heartbeat intervals |
| 250 | +- Gossipsub parameters (mesh size, fanout, etc.) |
| 251 | +- Discovery mechanisms and protocols |
| 252 | + |
| 253 | +## Troubleshooting |
| 254 | + |
| 255 | +### Common Issues |
| 256 | + |
| 257 | +**Nodes not connecting**: Check bootstrap peer address and ensure firewall allows UDP traffic on P2P ports. Verify that node 0 is started first as the bootstrap peer. |
| 258 | + |
| 259 | +**High or inconsistent latency readings**: Verify system clocks are synchronized across test nodes. Consider NTP setup for distributed testing. |
| 260 | + |
| 261 | +**Out-of-order messages**: This is normal in P2P networks. Monitor the `messages_out_of_order_total` metric to understand network behavior patterns. |
| 262 | + |
| 263 | +**Prometheus not scraping**: Confirm metric ports are accessible and Prometheus configuration includes all node endpoints. When using the local script, Prometheus runs in Docker and automatically configures all node endpoints. Check firewall rules and ensure Docker is running properly. |
| 264 | + |
| 265 | +**Docker permission errors for throttling**: Ensure privileged mode is enabled for network traffic shaping. The container needs CAP_NET_ADMIN capability. |
| 266 | + |
| 267 | +**Message size errors**: Ensure `--message-size-bytes` is at least 40 bytes (metadata size). Check the calculation in `converters.rs` if issues persist. |
| 268 | + |
| 269 | +### Debugging |
| 270 | + |
| 271 | +Enable verbose logging for detailed P2P communication: |
| 272 | +```bash |
| 273 | +# For local script (default verbosity is 2) |
| 274 | +python local.py --verbosity 5 |
| 275 | + |
| 276 | +# For direct binary usage |
| 277 | +--verbosity 5 |
| 278 | +``` |
| 279 | + |
| 280 | +Check individual node logs in Kubernetes: |
| 281 | +```bash |
| 282 | +kubectl logs -n network-stress-test-{timestamp} network-stress-test-0 -f |
| 283 | +``` |
| 284 | + |
| 285 | +Monitor live metrics during testing: |
| 286 | +```bash |
| 287 | +# View all metrics from a node |
| 288 | +curl http://localhost:2000/metrics |
| 289 | + |
| 290 | +# Monitor specific metrics |
| 291 | +curl -s http://localhost:2000/metrics | grep messages_received_total |
| 292 | +``` |
| 293 | + |
| 294 | +Use Prometheus queries for analysis: |
| 295 | +```promql |
| 296 | +# Average message latency by sender |
| 297 | +rate(message_delay_seconds_sum[5m]) / rate(message_delay_seconds_count[5m]) |
| 298 | +
|
| 299 | +# Message loss rate |
| 300 | +rate(messages_missing_total[5m]) / rate(messages_sent_total[5m]) |
| 301 | +
|
| 302 | +# Network throughput |
| 303 | +rate(bytes_received_total[5m]) |
| 304 | +``` |
0 commit comments