Skip to content

Commit 30f4a37

Browse files
apollo_network: broadcast network stress test draft
1 parent 61e6485 commit 30f4a37

File tree

17 files changed

+1935
-123
lines changed

17 files changed

+1935
-123
lines changed

Cargo.lock

Lines changed: 24 additions & 0 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

Cargo.toml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -303,6 +303,7 @@ statistical = "1.0.0"
303303
strum = "0.25.0"
304304
strum_macros = "0.25.2"
305305
syn = "2.0.39"
306+
sysinfo = "0.32.1"
306307
tar = "0.4.38"
307308
tempfile = "3.7.0"
308309
test-case = "3.2.1"

crates/apollo_network/Cargo.toml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -37,6 +37,7 @@ metrics-exporter-prometheus.workspace = true
3737
replace_with.workspace = true
3838
serde = { workspace = true, features = ["derive"] }
3939
starknet_api.workspace = true
40+
sysinfo.workspace = true
4041
thiserror.workspace = true
4142
tokio = { workspace = true, features = ["full", "sync"] }
4243
tokio-retry.workspace = true
Lines changed: 288 additions & 25 deletions
Original file line numberDiff line numberDiff line change
@@ -1,41 +1,304 @@
1-
# Network Stress Test
1+
# Broadcast Network Stress Test Node
22

3-
## Setup and Run Stress Test
3+
A comprehensive network stress testing tool for the Apollo network that tests P2P communication, measures performance metrics, and validates network behavior under various load patterns and conditions.
44

5-
1. **Create Remote Engines**
5+
## Overview
66

7-
Create 5 gcloud VM instances. Make sure to have the necessary RAM and disk space. Each instance should be named in the following pattern:
7+
The broadcast network stress test node is designed to stress test the P2P communication layer of the Apollo network. It creates a network of nodes with configurable broadcasting patterns, measuring latency, throughput, message ordering, and overall network performance. The tool supports both local testing (using the provided Python scripts) and distributed deployment via Kubernetes with optional network throttling.
88

9-
```
10-
<instance-name>-0, ... ,<instance-name>-4
11-
```
9+
## Features
1210

13-
2. **Set Bootstrap Node**
11+
- **Multiple Broadcasting Modes**: Supports different message broadcasting patterns (all nodes, single broadcaster, round-robin)
12+
- **Advanced Performance Metrics**: Measures message latency, throughput, delivery rates, ordering, and duplicate detection
13+
- **Message Ordering Analysis**: Tracks out-of-order messages, missing messages, and duplicates
14+
- **Prometheus Integration**: Exports detailed metrics with proper labels for monitoring and analysis
15+
- **Network Throttling**: Supports bandwidth and latency gating for realistic network conditions
16+
- **Configurable Parameters**: Customizable message sizes, send intervals, buffer sizes, and test duration
17+
- **Multi-Node Support**: Can run multiple coordinated nodes with different broadcasting patterns
18+
- **Kubernetes Deployment**: Includes YAML templates for cluster deployment with traffic shaping
19+
- **Deterministic Peer IDs**: Generates consistent peer identities for reproducible tests
1420

15-
Find the internal IP of your bootstrap node in the VM instances chart on google cloud console. Paste it into the test_config.json file into the bootstrap_peer_multaddr value instead of its placeholder.
21+
## Building
1622

17-
3. **Install Rust and clone repository**
23+
Build the stress test node binary:
1824

19-
For all 5 instances run:
25+
```bash
26+
cargo build --release --bin broadcast_network_stress_test_node
27+
```
2028

21-
```
22-
gcloud compute ssh <instance-name>-0 --project <project-name> -- 'cd <path-to-repo> && sudo apt install -y git unzip clang && curl https://sh.rustup.rs -sSf | sh -s -- -y && source "$HOME/.cargo/env" && git clone https://github.com/starkware-libs/sequencer.git; cd sequencer && sudo scripts/dependencies.sh cargo build --release -p apollo_network --bin network_stress_test'
23-
```
29+
## Command Line Arguments
2430

25-
4. **Run test**
31+
| Argument | Description | Default | Environment Variable |
32+
|----------|-------------|---------|---------------------|
33+
| `--id` | Node ID for identification and metrics | Required | `ID` |
34+
| `--num-nodes` | Total number of nodes in the network | 3 | `NUM_NODES` |
35+
| `--metric-port` | Prometheus metrics server port | 2000 | `METRIC_PORT` |
36+
| `--p2p-port` | P2P network port | 10000 | `P2P_PORT` |
37+
| `--bootstrap` | Bootstrap peer addresses (comma-separated) | None | `BOOTSTRAP` |
38+
| `--verbosity` | Log verbosity (0-5: None, ERROR, WARN, INFO, DEBUG, TRACE) | 2 | `VERBOSITY` |
39+
| `--buffer-size` | Broadcast topic buffer size | 10000 | `BUFFER_SIZE` |
40+
| `--message-size-bytes` | Message payload size in bytes | 1024 | `MESSAGE_SIZE_BYTES` |
41+
| `--heartbeat-millis` | Interval between messages (milliseconds) | 1 | `HEARTBEAT_MILLIS` |
42+
| `--mode` | Broadcasting mode: `all`, `one`, or `rr` | `all` | `MODE` |
43+
| `--broadcaster` | Node ID for broadcasting (OneBroadcast mode) | 1 | `BROADCASTER` |
44+
| `--round-duration-seconds` | Duration per node in RoundRobin mode | 3 | `ROUND_DURATION_SECONDS` |
2645

27-
```
28-
PROJECT_ID=<project-name> BASE_INSTANCE_NAME=<instance-name> ZONE=<zone> ./run_broadcast_stress_test.sh
29-
```
46+
## Broadcasting Modes
3047

31-
5. **Results**
48+
### All Broadcast (`all`)
49+
All nodes continuously broadcast messages simultaneously. Best for testing network capacity and concurrent message handling.
3250

33-
Results are retrieved from VM instances and saved to /output.csv. You can change the default path by adjusting the config file.
51+
### Single Broadcaster (`one`)
52+
Only the node specified by `--broadcaster` sends messages, while others act as receivers. Ideal for testing message propagation and network topology.
3453

35-
## Pull repo updates to virtual machines
54+
### Round Robin (`rr`)
55+
Nodes take turns broadcasting in sequential order based on their ID. Each node broadcasts for `--round-duration-seconds` before passing to the next. Useful for testing network behavior under changing load patterns.
3656

37-
1. **Run**
57+
## Running Locally
3858

39-
```
40-
PROJECT_ID=<project-name> BASE_INSTANCE_NAME=<instance-name> ZONE=<zone> ./pull_stress_test.sh
41-
```
59+
### Recommended: Multi-Node Network using Local Script
60+
61+
The best way to run locally is using the local script. First, navigate to the run directory:
62+
63+
```bash
64+
cd crates/apollo_network/src/bin/broadcast_network_stress_test_node/run
65+
python local.py --num-nodes 3 --verbosity 3 --mode rr
66+
```
67+
68+
This will:
69+
- Compile the binary if needed
70+
- Start 3 nodes with sequential ports (10000, 10001, 10002)
71+
- Automatically configure bootstrap peers for all nodes
72+
- Launch Prometheus in Docker for metrics collection
73+
- Provide a web interface at http://localhost:9090
74+
75+
### Manual Single Node (Advanced)
76+
77+
For direct binary testing (not recommended for most use cases):
78+
79+
```bash
80+
./target/release/broadcast_network_stress_test_node \
81+
--id 0 \
82+
--metric-port 2000 \
83+
--p2p-port 10000 \
84+
--verbosity 3 \
85+
--mode all
86+
```
87+
88+
### Advanced Local Testing
89+
90+
All commands should be run from the run directory:
91+
92+
```bash
93+
cd crates/apollo_network/src/bin/broadcast_network_stress_test_node/run
94+
95+
# Test round-robin mode with custom timing
96+
python local.py --num-nodes 5 --mode rr --round-duration-seconds 10 --heartbeat-millis 100
97+
98+
# Test single broadcaster mode
99+
python local.py --num-nodes 3 --mode one --broadcaster 0 --message-size-bytes 4096
100+
```
101+
102+
## Kubernetes Deployment
103+
104+
### Prerequisites
105+
106+
- Kubernetes cluster access
107+
- Docker registry access
108+
- kubectl configured
109+
110+
### Deploy to Cluster
111+
112+
```bash
113+
cd crates/apollo_network/src/bin/broadcast_network_stress_test_node/run
114+
python cluster_start.py --num-nodes 5 --latency 50 --throughput 1000 --mode rr
115+
```
116+
117+
This will:
118+
- Build and push a Docker image
119+
- Create Kubernetes StatefulSet with 5 nodes
120+
- Apply network throttling (50ms latency, 1000 KB/s throughput)
121+
- Deploy to a timestamped namespace
122+
123+
### Access Prometheus
124+
125+
```bash
126+
python cluster_port_forward_prometheus.py
127+
```
128+
129+
Then visit http://localhost:9090 for metrics visualization.
130+
131+
### Cleanup
132+
133+
```bash
134+
python cluster_stop.py
135+
```
136+
137+
## Network Throttling
138+
139+
The Docker deployment supports network traffic shaping to simulate realistic network conditions:
140+
141+
- **Latency Gating**: Add artificial delay to packets (via `LATENCY` environment variable in ms)
142+
- **Throughput Limiting**: Cap bandwidth to test under constrained conditions (via `THROUGHPUT` environment variable in KB/s)
143+
144+
The entrypoint script uses Linux traffic control (`tc`) with HTB (Hierarchical Token Bucket) for bandwidth limiting and NetEm for latency simulation.
145+
146+
## Metrics
147+
148+
The tool exports comprehensive Prometheus metrics with proper labels:
149+
150+
### Message Flow Metrics
151+
- `messages_sent_total`: Total messages sent by this node
152+
- `messages_received_total`: Total messages received (with `sender_id` label)
153+
- `bytes_received_total`: Total bytes received across all messages
154+
155+
### Performance Metrics
156+
- `message_delay_seconds`: End-to-end message latency histogram (with `sender_id` label)
157+
158+
### Message Ordering Metrics
159+
- `messages_out_of_order_total`: Messages received out of sequence (with `sender_id` label)
160+
- `messages_missing_total`: Messages that appear to be missing (with `sender_id` label)
161+
- `messages_duplicate_total`: Duplicate messages detected (with `sender_id` label)
162+
- `messages_missing_retrieved_total`: Previously missing messages that arrived late (with `sender_id` label)
163+
164+
All metrics include appropriate labels for per-sender analysis, enabling detailed network behavior study.
165+
166+
## Configuration
167+
168+
### Message Structure
169+
170+
Each stress test message contains:
171+
- **Sender ID**: Node identifier (8 bytes)
172+
- **Message Index**: Sequential message number from sender (8 bytes)
173+
- **Timestamp**: Send time as nanoseconds since UNIX epoch (16 bytes)
174+
- **Payload Length**: Size of variable payload (8 bytes)
175+
- **Payload**: Configurable data (remaining bytes)
176+
177+
### Network Topology
178+
179+
- All nodes join the same gossipsub topic: `stress_test_topic`
180+
- Node 0 typically acts as the bootstrap peer for network discovery
181+
- Deterministic peer IDs based on node ID ensure consistent network formation
182+
- Secret keys are generated deterministically from node ID for reproducibility
183+
184+
## Example Use Cases
185+
186+
### Latency Testing
187+
```bash
188+
# Test with 100ms network latency
189+
python cluster_start.py --num-nodes 3 --latency 100 --message-size-bytes 512 --mode all
190+
```
191+
192+
### Throughput Testing
193+
```bash
194+
# Test with 500 KB/s bandwidth limit
195+
python cluster_start.py --num-nodes 5 --throughput 500 --heartbeat-millis 10 --mode rr
196+
```
197+
198+
### Large Message Testing
199+
```bash
200+
# Test with 64KB messages in single broadcaster mode (run from the run directory)
201+
cd crates/apollo_network/src/bin/broadcast_network_stress_test_node/run
202+
python local.py --num-nodes 3 --message-size-bytes 65536 --heartbeat-millis 100 --mode one
203+
```
204+
205+
### Network Resilience Testing
206+
```bash
207+
# Test round-robin with constrained network
208+
python cluster_start.py --num-nodes 4 --latency 200 --throughput 100 --mode rr --round-duration-seconds 30
209+
```
210+
211+
## Development
212+
213+
### File Structure
214+
215+
- `main.rs`: Core stress test logic, broadcasting modes, and coordination
216+
- `converters.rs`: Message serialization/deserialization with ordering support
217+
- `converters_test.rs`: Unit tests for message conversion
218+
- `utils.rs`: Configuration utilities and helper functions
219+
- `run/`: Deployment scripts and configurations
220+
- `local.py`: Local multi-node testing with Prometheus
221+
- `cluster_start.py`: Kubernetes deployment with throttling
222+
- `cluster_stop.py`: Cleanup deployed resources
223+
- `cluster_port_forward_prometheus.py`: Prometheus access helper
224+
- `yaml_maker.py`: Kubernetes YAML generation
225+
- `args.py`: Shared argument parsing for Python scripts
226+
- `utils.py`: Common utility functions
227+
- `Dockerfile`: Container image with traffic shaping capabilities
228+
- `entrypoint.sh`: Container startup script with network throttling
229+
- Various Kubernetes YAML templates
230+
231+
### Adding New Metrics
232+
233+
1. Import metrics crate: `use metrics::{counter, histogram, gauge};`
234+
2. Add metric recording in message handlers or broadcasting logic
235+
3. Use appropriate labels for detailed analysis
236+
4. Update Prometheus configuration in deployment scripts if needed
237+
238+
### Adding New Broadcasting Modes
239+
240+
1. Extend the `Mode` enum in `main.rs`
241+
2. Update the mode-specific logic in `send_stress_test_messages()`
242+
3. Add corresponding argument parsing in `args.py`
243+
4. Update documentation and examples
244+
245+
### Network Configuration
246+
247+
Modify `NetworkConfig` parameters in `main.rs` for different P2P behaviors:
248+
- Connection limits and timeouts
249+
- Heartbeat intervals
250+
- Gossipsub parameters (mesh size, fanout, etc.)
251+
- Discovery mechanisms and protocols
252+
253+
## Troubleshooting
254+
255+
### Common Issues
256+
257+
**Nodes not connecting**: Check bootstrap peer address and ensure firewall allows UDP traffic on P2P ports. Verify that node 0 is started first as the bootstrap peer.
258+
259+
**High or inconsistent latency readings**: Verify system clocks are synchronized across test nodes. Consider NTP setup for distributed testing.
260+
261+
**Out-of-order messages**: This is normal in P2P networks. Monitor the `messages_out_of_order_total` metric to understand network behavior patterns.
262+
263+
**Prometheus not scraping**: Confirm metric ports are accessible and Prometheus configuration includes all node endpoints. When using the local script, Prometheus runs in Docker and automatically configures all node endpoints. Check firewall rules and ensure Docker is running properly.
264+
265+
**Docker permission errors for throttling**: Ensure privileged mode is enabled for network traffic shaping. The container needs CAP_NET_ADMIN capability.
266+
267+
**Message size errors**: Ensure `--message-size-bytes` is at least 40 bytes (metadata size). Check the calculation in `converters.rs` if issues persist.
268+
269+
### Debugging
270+
271+
Enable verbose logging for detailed P2P communication:
272+
```bash
273+
# For local script (default verbosity is 2)
274+
python local.py --verbosity 5
275+
276+
# For direct binary usage
277+
--verbosity 5
278+
```
279+
280+
Check individual node logs in Kubernetes:
281+
```bash
282+
kubectl logs -n network-stress-test-{timestamp} network-stress-test-0 -f
283+
```
284+
285+
Monitor live metrics during testing:
286+
```bash
287+
# View all metrics from a node
288+
curl http://localhost:2000/metrics
289+
290+
# Monitor specific metrics
291+
curl -s http://localhost:2000/metrics | grep messages_received_total
292+
```
293+
294+
Use Prometheus queries for analysis:
295+
```promql
296+
# Average message latency by sender
297+
rate(message_delay_seconds_sum[5m]) / rate(message_delay_seconds_count[5m])
298+
299+
# Message loss rate
300+
rate(messages_missing_total[5m]) / rate(messages_sent_total[5m])
301+
302+
# Network throughput
303+
rate(bytes_received_total[5m])
304+
```

0 commit comments

Comments
 (0)