Skip to content

Proposal for New Decentralized Logging and Metrics Collection Architecture for Bittensor Subnet #265

@epappas

Description

@epappas

Background

Currently, we utilize Loki for log collection and an AWS-managed timestream InfluxDB for telemetry metrics. These services are publicly exposed, leading to considerable operational expenses and vulnerability to DDoS attacks, particularly impacting InfluxDB.

Current Architecture

  • Loki:

    • Currently positioned behind an NGINX reverse proxy for better access control.
    • Logs are exported periodically to a publicly accessible R2 bucket, providing redundancy and independent log access for subnet participants without reliance on Grafana.
    • This setup is efficient, secure (although we still need a robust authN), and customizable due to the self-managed nature of Loki.
  • InfluxDB:

    • Managed AWS Timestream InfluxDB running on a x8 large instance.
    • Very costly, and despite optimizations, still vulnerable to saturation under high traffic conditions (potential DDoS scenario).

Challenges

  • High operational costs associated with InfluxDB.
  • Lack of robust protection against metric flooding, causing downtime or degraded performance for miners and validators and poor experience for the website visitors.

Proposed Solutions

We propose two distinct architectural pathways for optimizing logging and metric ingestion:

Option 1: Reverse Proxy with Traffic Throttling

This approach involves placing AWS-managed InfluxDB behind an NGINX reverse proxy:

  • Phase 1:

    • Implement NGINX proxy to act as a traffic gateway for InfluxDB.
    • Configure load balancing and throttling logic to discard lower-priority metrics when reaching capacity.
    • Time estimate: ~1 day implementation.
  • Phase 2:

    • Implement basic HTTP authentication at the NGINX layer.
    • Deploy credential management for miners, adding onboarding steps to control ingress strictly.
    • Provides granular control over metrics submissions, enabling targeted blocking of malicious actors.
    • Time estimate: 2-3 days additional.

Pros:

  • Improved availability by throttling excessive traffic.
  • Enhanced security via miner authentication.

Cons:

  • Metrics data loss when throttling occurs.
  • Additional complexity in miner onboarding and credential management.
  • Cost will slightly increase due ton an additional LB

Option 2: Decentralized Pull-Based Metrics Collection (Recommended)

This architecture shifts from a centralized push model to a decentralized pull model, significantly reducing operational cost and enhancing reliability:

  • Each miner node writes metrics locally into its own logs r2 bucket.
  • Your reveal your logs bucket (with READ-only access), either through an onchain commit when the node starts, or via an API call to a service that I'll introduce. Similar to how we reveal gradients for gathering.
  • A monitoring service periodically (every minute) collects these metrics, consolidating data into a significantly smaller, private InfluxDB instance.
  • The metrics are then visualized in Grafana.
  • Potentially I could expose the aggregated metrics to an R2 bucket as we do currently with Loki logs (this is for future consideration to decide)

This architecture is similar to how Prometheus gathers metrics (pull model), and often used in federated networks.

Example Workflow:

  1. Miner initializes and exposes metrics through a local endpoint or commits bucket details to chain.
  2. Central monitoring service pulls metrics every minute.
  3. Metrics aggregated into private, scaled-down InfluxDB for visualization.
def collect_metrics(miner_list):
    for miner in miner_list:
        metrics = request_metrics(miner.endpoint)
        store_metrics_in_influxdb(metrics)

Pros:

  • Significant reduction in operational costs (estimated 2-fold reduction).
  • Enhanced service availability due to reduced centralized bottleneck.
  • Decentralization aligns with blockchain principles, enhancing resilience.
  • No performance degradation on miner or validator nodes during database outages.

Cons:

  • Delay of collection between 1-5minutes

Time estimate: ~3 days to implement fully.

Comment

Considering cost, and operational efficiency, I think Option 2 addresses our current scalability and cost needs.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions