| title | Rating API Operations |
|---|---|
| service | rating-api |
| status | active |
| last_reviewed | 2025-12-02 |
| type | operations |
- Containerized via Docker
- Exposes port
8013 - Requires PostgreSQL
- Optional: NATS (JetStream) for event publishing (
rating.updated) - Kafka consumer for game-ended events
DATABASE_URL: Postgres DSN (asyncpg)REQUIRE_AUTH: enable bearer auth for all endpointsINTERNAL_BEARER_TOKEN: shared token for internal callsGLICKO_*: engine defaultsOUTBOX_ENABLED: toggle event outbox publisher (default: true)OUTBOX_NATS_URL: NATS server URL (default:nats://nats:4222)OUTBOX_PUBLISH_INTERVAL_SEC: polling interval for publisherKAFKA_BOOTSTRAP_SERVERS: Kafka broker addressesKAFKA_GAME_EVENTS_TOPIC: Topic name for game eventsKAFKA_CONSUMER_ENABLED: Enable/disable Kafka consumer
GET /health- Service liveness
Prometheus Metrics: GET /metrics → Exports Prometheus-formatted metrics
http_requests_total- Total HTTP requests by method, endpoint, and statushttp_request_latency_seconds- Request latency histogram (p50, p95, p99)http_errors_total- HTTP error count by status code
rating_updates_total(counter) - Total rating updates by pool_idrating_update_latency_seconds(histogram) - Rating update processing latencyrating_event_processing_lag_seconds(histogram) - Lag between event timestamp and processing time
db_query_duration_seconds- Database query duration by operation type
kafka_events_consumed_total- Kafka events consumed by event type and statuskafka_event_processing_duration_seconds- Event processing durationkafka_event_processing_errors_total- Event processing errorskafka_consumer_lag- Kafka consumer lag in messages
OpenTelemetry tracing is enabled and configured. Traces include:
- HTTP request/response spans
- Database query spans
- Kafka event processing spans
- Rating calculation spans
Trace IDs are included in structured logs for correlation.
All logs are structured (JSON format) and include:
- Correlation IDs (from request headers)
- Trace IDs (from OpenTelemetry)
- Timestamp, log level, service name
- Contextual information (game_id, user_id, pool_id, etc.)
- Target: 99.9% uptime (approximately 43 minutes downtime per month)
- Measurement: Service liveness endpoint (
/health) responding with 200 OK - Alerting: Alert if availability drops below 99.9% over a 30-day window
- Rating Update Processing: p95 < 100ms, p99 < 200ms
- Measures time from event receipt to rating calculation and persistence
- HTTP API Requests: p95 < 200ms, p99 < 500ms
- Measures end-to-end HTTP request latency
- Event Processing Lag: p95 < 1 minute
- Measures time between event timestamp and processing time
- Target: < 0.1% error rate (5xx errors / total requests)
- Measurement: Count of 5xx responses divided by total requests
- Alerting: Alert if error rate exceeds 0.5% over a 5-minute window
- Rating Updates: Handle 20M rating updates/day (peak ~230 updates/second)
- Event Processing: Process Kafka events with minimal lag (< 1 minute)
Symptoms:
rating_event_processing_lag_secondsp95 > 60skafka_consumer_lagincreasing
Solution:
- Check Kafka consumer group status
- Verify database connection pool isn't exhausted
- Review rating calculation performance
- Check for blocking database operations
- Consider scaling out consumer instances
Symptoms:
rating_update_latency_secondsp95 > 100ms- High database query latency
Solution:
- Check database indexes on
user_ratingsandrating_ingestionstables - Review slow query logs
- Check connection pool configuration
- Verify leaderboard update performance
Symptoms:
http_errors_totalorkafka_event_processing_errors_totalincreasing- Error rate > 0.1%
Solution:
- Check logs for error patterns (filter by trace_id)
- Review database connection errors
- Check for data integrity issues (missing pools, invalid game results)
- Verify idempotency handling (duplicate game_id processing)