chessmate/rating-api/docs/operations.md at main · code-and-chill/chessmate

title	Rating API Operations
service	rating-api
status	active
last_reviewed	2025-12-02
type	operations

Deployment

Containerized via Docker
Exposes port 8013
Requires PostgreSQL
Optional: NATS (JetStream) for event publishing (rating.updated)
Kafka consumer for game-ended events

Config

DATABASE_URL: Postgres DSN (asyncpg)
REQUIRE_AUTH: enable bearer auth for all endpoints
INTERNAL_BEARER_TOKEN: shared token for internal calls
GLICKO_*: engine defaults
OUTBOX_ENABLED: toggle event outbox publisher (default: true)
OUTBOX_NATS_URL: NATS server URL (default: nats://nats:4222)
OUTBOX_PUBLISH_INTERVAL_SEC: polling interval for publisher
KAFKA_BOOTSTRAP_SERVERS: Kafka broker addresses
KAFKA_GAME_EVENTS_TOPIC: Topic name for game events
KAFKA_CONSUMER_ENABLED: Enable/disable Kafka consumer

Health

GET /health - Service liveness

Observability

Metrics Endpoint

Prometheus Metrics: GET /metrics → Exports Prometheus-formatted metrics

Key Metrics

HTTP Metrics

http_requests_total - Total HTTP requests by method, endpoint, and status
http_request_latency_seconds - Request latency histogram (p50, p95, p99)
http_errors_total - HTTP error count by status code

Business Metrics

rating_updates_total (counter) - Total rating updates by pool_id
rating_update_latency_seconds (histogram) - Rating update processing latency
rating_event_processing_lag_seconds (histogram) - Lag between event timestamp and processing time

Database Metrics

db_query_duration_seconds - Database query duration by operation type

Kafka Metrics

kafka_events_consumed_total - Kafka events consumed by event type and status
kafka_event_processing_duration_seconds - Event processing duration
kafka_event_processing_errors_total - Event processing errors
kafka_consumer_lag - Kafka consumer lag in messages

Distributed Tracing

OpenTelemetry tracing is enabled and configured. Traces include:

HTTP request/response spans
Database query spans
Kafka event processing spans
Rating calculation spans

Trace IDs are included in structured logs for correlation.

Structured Logging

All logs are structured (JSON format) and include:

Correlation IDs (from request headers)
Trace IDs (from OpenTelemetry)
Timestamp, log level, service name
Contextual information (game_id, user_id, pool_id, etc.)

Service Level Objectives (SLOs)

Availability

Target: 99.9% uptime (approximately 43 minutes downtime per month)
Measurement: Service liveness endpoint (/health) responding with 200 OK
Alerting: Alert if availability drops below 99.9% over a 30-day window

Latency

Rating Update Processing: p95 < 100ms, p99 < 200ms
- Measures time from event receipt to rating calculation and persistence
HTTP API Requests: p95 < 200ms, p99 < 500ms
- Measures end-to-end HTTP request latency
Event Processing Lag: p95 < 1 minute
- Measures time between event timestamp and processing time

Error Rate

Target: < 0.1% error rate (5xx errors / total requests)
Measurement: Count of 5xx responses divided by total requests
Alerting: Alert if error rate exceeds 0.5% over a 5-minute window

Throughput

Rating Updates: Handle 20M rating updates/day (peak ~230 updates/second)
Event Processing: Process Kafka events with minimal lag (< 1 minute)

Troubleshooting

Common Issues

Issue: High event processing lag

Symptoms:

rating_event_processing_lag_seconds p95 > 60s
kafka_consumer_lag increasing

Solution:

Check Kafka consumer group status
Verify database connection pool isn't exhausted
Review rating calculation performance
Check for blocking database operations
Consider scaling out consumer instances

Issue: Slow rating updates

Symptoms:

rating_update_latency_seconds p95 > 100ms
High database query latency

Solution:

Check database indexes on user_ratings and rating_ingestions tables
Review slow query logs
Check connection pool configuration
Verify leaderboard update performance

Issue: High error rate

Symptoms:

http_errors_total or kafka_event_processing_errors_total increasing
Error rate > 0.1%

Solution:

Check logs for error patterns (filter by trace_id)
Review database connection errors
Check for data integrity issues (missing pools, invalid game results)
Verify idempotency handling (duplicate game_id processing)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deployment

Config

Health

Observability

Metrics Endpoint

Key Metrics

HTTP Metrics

Business Metrics

Database Metrics

Kafka Metrics

Distributed Tracing

Structured Logging

Service Level Objectives (SLOs)

Availability

Latency

Error Rate

Throughput

Troubleshooting

Common Issues

Issue: High event processing lag

Issue: Slow rating updates

Issue: High error rate

FilesExpand file tree

operations.md

Latest commit

History

operations.md

File metadata and controls

Deployment

Config

Health

Observability

Metrics Endpoint

Key Metrics

HTTP Metrics

Business Metrics

Database Metrics

Kafka Metrics

Distributed Tracing

Structured Logging

Service Level Objectives (SLOs)

Availability

Latency

Error Rate

Throughput

Troubleshooting

Common Issues

Issue: High event processing lag

Issue: Slow rating updates

Issue: High error rate