Background
Currently, we utilize Loki for log collection and an AWS-managed timestream InfluxDB for telemetry metrics. These services are publicly exposed, leading to considerable operational expenses and vulnerability to DDoS attacks, particularly impacting InfluxDB.
Current Architecture
-
Loki:
- Currently positioned behind an NGINX reverse proxy for better access control.
- Logs are exported periodically to a publicly accessible R2 bucket, providing redundancy and independent log access for subnet participants without reliance on Grafana.
- This setup is efficient, secure (although we still need a robust authN), and customizable due to the self-managed nature of Loki.
-
InfluxDB:
- Managed AWS Timestream InfluxDB running on a x8 large instance.
- Very costly, and despite optimizations, still vulnerable to saturation under high traffic conditions (potential DDoS scenario).
Challenges
- High operational costs associated with InfluxDB.
- Lack of robust protection against metric flooding, causing downtime or degraded performance for miners and validators and poor experience for the website visitors.
Proposed Solutions
We propose two distinct architectural pathways for optimizing logging and metric ingestion:
Option 1: Reverse Proxy with Traffic Throttling
This approach involves placing AWS-managed InfluxDB behind an NGINX reverse proxy:
-
Phase 1:
- Implement NGINX proxy to act as a traffic gateway for InfluxDB.
- Configure load balancing and throttling logic to discard lower-priority metrics when reaching capacity.
- Time estimate: ~1 day implementation.
-
Phase 2:
- Implement basic HTTP authentication at the NGINX layer.
- Deploy credential management for miners, adding onboarding steps to control ingress strictly.
- Provides granular control over metrics submissions, enabling targeted blocking of malicious actors.
- Time estimate: 2-3 days additional.
Pros:
- Improved availability by throttling excessive traffic.
- Enhanced security via miner authentication.
Cons:
- Metrics data loss when throttling occurs.
- Additional complexity in miner onboarding and credential management.
- Cost will slightly increase due ton an additional LB
Option 2: Decentralized Pull-Based Metrics Collection (Recommended)
This architecture shifts from a centralized push model to a decentralized pull model, significantly reducing operational cost and enhancing reliability:
- Each miner node writes metrics locally into its own logs r2 bucket.
- Your reveal your logs bucket (with READ-only access), either through an onchain commit when the node starts, or via an API call to a service that I'll introduce. Similar to how we reveal gradients for gathering.
- A monitoring service periodically (every minute) collects these metrics, consolidating data into a significantly smaller, private InfluxDB instance.
- The metrics are then visualized in Grafana.
- Potentially I could expose the aggregated metrics to an R2 bucket as we do currently with Loki logs (this is for future consideration to decide)
This architecture is similar to how Prometheus gathers metrics (pull model), and often used in federated networks.
Example Workflow:
- Miner initializes and exposes metrics through a local endpoint or commits bucket details to chain.
- Central monitoring service pulls metrics every minute.
- Metrics aggregated into private, scaled-down InfluxDB for visualization.
def collect_metrics(miner_list):
for miner in miner_list:
metrics = request_metrics(miner.endpoint)
store_metrics_in_influxdb(metrics)
Pros:
- Significant reduction in operational costs (estimated 2-fold reduction).
- Enhanced service availability due to reduced centralized bottleneck.
- Decentralization aligns with blockchain principles, enhancing resilience.
- No performance degradation on miner or validator nodes during database outages.
Cons:
- Delay of collection between 1-5minutes
Time estimate: ~3 days to implement fully.
Comment
Considering cost, and operational efficiency, I think Option 2 addresses our current scalability and cost needs.
Background
Currently, we utilize Loki for log collection and an AWS-managed timestream InfluxDB for telemetry metrics. These services are publicly exposed, leading to considerable operational expenses and vulnerability to DDoS attacks, particularly impacting InfluxDB.
Current Architecture
Loki:
InfluxDB:
Challenges
Proposed Solutions
We propose two distinct architectural pathways for optimizing logging and metric ingestion:
Option 1: Reverse Proxy with Traffic Throttling
This approach involves placing AWS-managed InfluxDB behind an NGINX reverse proxy:
Phase 1:
Phase 2:
Pros:
Cons:
Option 2: Decentralized Pull-Based Metrics Collection (Recommended)
This architecture shifts from a centralized push model to a decentralized pull model, significantly reducing operational cost and enhancing reliability:
This architecture is similar to how Prometheus gathers metrics (pull model), and often used in federated networks.
Example Workflow:
Pros:
Cons:
Time estimate: ~3 days to implement fully.
Comment
Considering cost, and operational efficiency, I think Option 2 addresses our current scalability and cost needs.