Skip to content

shupandee/MELT_generator

Repository files navigation

Synthetic MELT Data Generator

A comprehensive synthetic data generator for MELT (Metrics, Events, Logs, Traces) observability data, designed specifically for testing anomaly detection and Root Cause Analysis (RCA) systems.

Overview

This tool generates realistic synthetic observability data that mirrors production patterns while allowing controlled injection of anomalies for testing detection pipelines. Unlike simple random data generators, it creates temporally correlated, schema-compliant data with realistic relationships between different observability signals.

Key Features

  • Realistic Data Patterns: Generates time-series data with natural correlations and distributions
  • Multi-Scale Generation: Supports datasets from 1MB to 10GB+ with efficient memory usage
  • Anomaly Injection: Built-in anomaly patterns (CPU spikes, service outages, error bursts, latency spikes)
  • Schema Compliance: Well-defined schemas compatible with popular observability tools
  • Reproducible Output: Deterministic generation using configurable random seeds
  • Production Ready: CLI interface, Docker support, comprehensive testing

Quick Start

Installation

# Clone the repository
git clone https://github.com/your-username/synthetic-melt-generator
cd synthetic-melt-generator

# Install dependencies
pip install -r requirements.txt

# Verify installation
python -c "from synthetic_melt_generator import MeltGenerator; print('Installation successful!')"

Basic Usage

from synthetic_melt_generator import MeltGenerator
from synthetic_melt_generator.core.models import GenerationConfig

# Create generator with custom configuration
config = GenerationConfig(
    seed=42,
    duration_hours=24,
    services=["api", "web", "database"],
    environments=["production", "staging"]
)

generator = MeltGenerator(config)

# Generate 100MB of synthetic data with anomalies
data = generator.generate(
    size="100MB",
    anomalies=["cpu_spike", "error_burst"]
)

# Save to files
generator.save(data, "./output")
print(f"Generated {data.total_records():,} records")

Command Line Interface

# Generate 1GB dataset with multiple anomaly types
python -m synthetic_melt_generator.cli generate \
    --size 1GB \
    --anomalies cpu_spike,service_outage,latency_spike \
    --output ./data \
    --seed 42

# View dataset information
python -m synthetic_melt_generator.cli info --data-dir ./data

# Get help
python -m synthetic_melt_generator.cli --help

Quick Demo

# Run the interactive demo
python simple_demo.py

# Expected output:
# Generated 2,547 records in 0.23s
# 📈 Metrics: 1,018 records
# 📅 Events: 85 records  
# 📝 Logs: 1,273 records
# 🔍 Traces: 171 records

Data Types & Schemas

Metrics

Time-series numerical data with labels and anomaly indicators:

{
  "timestamp": "2024-01-01T12:00:00Z",
  "metric_name": "cpu_usage",
  "value": 45.2,
  "labels": {
    "service": "api",
    "host": "web-01",
    "environment": "production",
    "unit": "percent"
  },
  "anomaly": false
}

Events

Discrete operational events with severity levels:

{
  "timestamp": "2024-01-01T12:00:00Z",
  "event_type": "deployment",
  "severity": "info",
  "source": "ci-cd-pipeline", 
  "message": "Service api v2.1.0 deployed successfully",
  "metadata": {
    "service": "api",
    "version": "v2.1.0"
  }
}

Logs

Structured application logs with trace correlation:

{
  "timestamp": "2024-01-01T12:00:00Z",
  "level": "ERROR",
  "service": "payment-service",
  "message": "Database connection timeout",
  "trace_id": "abc123def456",
  "span_id": "789ghi012",
  "metadata": {
    "host": "web-01",
    "error_code": "DB_TIMEOUT"
  }
}

Traces

Distributed tracing spans with parent-child relationships:

{
  "trace_id": "abc123def456",
  "span_id": "789ghi012", 
  "parent_span_id": "456def789",
  "operation_name": "database_query",
  "start_time": "2024-01-01T12:00:00Z",
  "duration": 150000,
  "service": "user-service",
  "tags": {
    "db.type": "postgresql"
  },
  "status": "ok"
}

See SCHEMAS.md for complete schema specifications.

Architecture

Design Principles

  1. Modular Generation: Separate generators for each data type enable independent scaling
  2. Realistic Patterns: Data follows observed production distributions and correlations
  3. Temporal Consistency: All data types maintain proper time relationships
  4. Anomaly Injection: Pluggable system for injecting various failure patterns
  5. Memory Efficiency: Streaming generation supports large datasets with constant memory usage

Core Components

synthetic_melt_generator/
├── core/
│   ├── generator.py      # Main coordination logic
│   ├── models.py         # Data models and configuration
│   └── streaming.py      # Memory-efficient large dataset generation
├── generators/
│   ├── metrics.py        # Time-series metrics generation
│   ├── events.py         # Operational events generation
│   ├── logs.py           # Structured logs generation
│   └── traces.py         # Distributed tracing generation
├── anomalies/
│   ├── cpu_spike.py      # CPU utilization anomalies
│   ├── service_outage.py # Service availability issues  
│   ├── latency_spike.py  # Response time anomalies
│   └── error_burst.py    # Error rate increases
└── cli.py                # Command-line interface

Anomaly Types

CPU Spike

Realistic CPU utilization spikes affecting multiple metrics:

  • Pattern: Sudden increase to 80-95% utilization
  • Duration: 3-10 minutes (configurable)
  • Correlation: Affects response times and error rates

Service Outage

Complete or partial service unavailability:

  • Pattern: HTTP 5xx errors, connection failures
  • Duration: 5-30 minutes (configurable)
  • Correlation: Cascade failures in dependent services

Latency Spike

Response time degradation patterns:

  • Pattern: 2-10x normal response times
  • Duration: 2-15 minutes (configurable)
  • Correlation: Often precedes error bursts

Error Burst

Increased application error rates:

  • Pattern: Error rate jumps from ~1% to 10-50%
  • Duration: 1-5 minutes (configurable)
  • Correlation: Higher log error levels, failed traces

Performance Characteristics

Dataset Size Generation Time Memory Usage Rate
10MB 1.2s <50MB ~8,500 rec/sec
100MB 11.8s <100MB ~9,200 rec/sec
1GB 118s <200MB ~9,800 rec/sec
10GB 1,180s <500MB ~10,100 rec/sec

Benchmarked on 8-core CPU, 16GB RAM

Configuration

Generation Config

config = GenerationConfig(
    seed=42,                    # Reproducible randomness
    duration_hours=24,          # Time span to simulate
    services=["api", "web"],    # Service names
    hosts=["web-01", "db-01"],  # Host names
    environments=["prod"],      # Environment names
    
    # Data frequencies
    metrics_frequency_seconds=30,
    logs_frequency_seconds=1, 
    traces_frequency_seconds=10,
    events_frequency_seconds=300,
    
    # Realism parameters
    error_rate=0.05,           # Base error rate (5%)
    debug_log_ratio=0.3,       # Debug logs (30%)
    missing_span_rate=0.02     # Missing trace spans (2%)
)

Anomaly Config

anomaly_config = AnomalyConfig(
    cpu_spike_probability=0.05,      # 5% chance per time window
    cpu_spike_duration_minutes=5,    # 5-minute spikes
    cpu_spike_intensity=3.0,         # 3x normal usage
    
    service_outage_probability=0.01, # 1% chance per time window  
    service_outage_duration_minutes=10,
    
    latency_spike_multiplier=5.0,    # 5x normal latency
    error_burst_rate=0.5             # 50% error rate during burst
)

Testing

Run Test Suite

# Basic functionality tests
python test_basic.py

# Expected output:
# 🧪 Testing basic generation...
# ✅ Generated 1,847 records
# 🧪 Testing anomaly injection...  
# ✅ Found 12 anomalous metrics
# 🧪 Testing data schemas...
# ✅ All schemas valid
# 📊 Test Results: 3 passed, 0 failed

Validation

# Schema validation
python -m synthetic_melt_generator.cli generate --size 10MB --output ./test
python -c "
import json
with open('./test/metrics.jsonl') as f:
    metric = json.loads(f.readline())
    print('Sample metric:', metric)
    assert 'timestamp' in metric
    assert 'value' in metric  
    print('✅ Schema validation passed')
"

Design Trade-offs

Realism vs Performance

Choice: Prioritized realistic data patterns over raw generation speed

  • Benefit: Data closely mimics production observability patterns
  • Cost: ~20% slower than pure random generation
  • Rationale: Realistic testing data is more valuable for ML model training

Memory vs Disk I/O

Choice: Streaming generation with batch writes

  • Benefit: Constant memory usage regardless of dataset size
  • Cost: More complex implementation, higher disk I/O
  • Rationale: Enables generation of multi-GB datasets on resource-constrained systems

Flexibility vs Simplicity

Choice: Rich configuration options with sensible defaults

  • Benefit: Adapts to diverse testing scenarios
  • Cost: More complex API surface
  • Rationale: Different anomaly detection systems need different data characteristics

Determinism vs Variety

Choice: Configurable random seeds with pseudo-random patterns

  • Benefit: Reproducible test datasets while maintaining statistical properties
  • Cost: Patterns may become predictable with analysis
  • Rationale: Reproducibility is crucial for ML pipeline testing and debugging

Use Cases

Anomaly Detection Testing

Generate labeled datasets with known anomalies for model validation:

# Generate training data with 5% anomaly rate
training_data = generator.generate(
    size="1GB",
    anomalies=["cpu_spike", "error_burst"]  
)

# Generate clean validation data
validation_data = generator.generate(
    size="100MB", 
    anomalies=None
)

Load Testing Observability Systems

Create realistic data volumes for performance testing:

# Generate high-volume data stream
large_dataset = generator.generate(size="10GB")
# Test ingestion rates, storage efficiency, query performance

RCA System Development

Generate correlated failure patterns for root cause analysis:

# Create incident scenarios with cascading failures
incident_data = generator.generate(
    size="500MB",
    anomalies=["service_outage"]  # Triggers dependent failures
)

Contributing

  1. Fork the repository
  2. Create feature branch: git checkout -b feature/new-anomaly-type
  3. Make changes with tests: python test_basic.py
  4. Commit changes: git commit -m 'Add new anomaly pattern'
  5. Push branch: git push origin feature/new-anomaly-type
  6. Create Pull Request

Development Setup

# Install development dependencies
pip install -r requirements-dev.txt

# Run linting
flake8 synthetic_melt_generator/
black synthetic_melt_generator/

# Run type checking  
mypy synthetic_melt_generator/

License

MIT License - see LICENSE file for details.

Support & Documentation

Acknowledgments

Built as part of the BugRaid AI Assessment, demonstrating synthetic data generation techniques for observability system testing.


Note: This is a proof-of-concept implementation focused on demonstrating realistic synthetic data generation patterns. For production use, consider additional features like data encryption, advanced streaming protocols, and integration with specific observability platforms.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages