Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
380 changes: 380 additions & 0 deletions KUBERNETES_SCALING.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,380 @@
# Kubernetes High Availability and Scaling Guide

This guide explains how to deploy Memos in a Kubernetes environment with proper session management for horizontal scaling and high availability.

## Description

Till v0.25.0, Memos had limitations when deployed as multiple pods in Kubernetes:

1. **Session Isolation**: Each pod maintained its own in-memory session cache, causing authentication inconsistencies when load balancers directed users to different pods.

2. **SSO Redirect Issues**: OAuth2 authentication flows would fail when:
- User initiated login on Pod A
- OAuth provider redirected back to Pod B
- Pod B couldn't validate the session created by Pod A

3. **Cache Inconsistency**: Session updates on one pod weren't reflected on other pods until cache expiry (10+ minutes).

## Solution Overview

The solution implements a **distributed cache system** with the following features:

- **Redis-backed shared cache** for session synchronization across pods
- **Hybrid cache strategy** with local cache fallback for resilience
- **Event-driven cache invalidation** for real-time consistency
- **Backward compatibility** - works without Redis for single-pod deployments

## Architecture

### Production Architecture with External Services

```
┌─────────────────────────────────────────────────────────────┐
│ Load Balancer (Ingress) │
└─────────────┬─────────────┬─────────────┬─────────────────┘
│ │ │
┌────▼────┐ ┌────▼────┐ ┌────▼────┐
│ Pod A │ │ Pod B │ │ Pod C │
│ │ │ │ │ │
└────┬────┘ └────┬────┘ └────┬────┘
│ │ │
└─────────────┼─────────────┘
┌─────────────┼─────────────┐
│ │ │
┌─────────▼─────────┐ │ ┌─────────▼─────────┐
│ Redis Cache │ │ │ ReadWriteMany │
│ (ElastiCache) │ │ │ Storage (EFS) │
│ Distributed │ │ │ Shared Files │
│ Sessions │ │ │ & Attachments │
└───────────────────┘ │ └───────────────────┘
┌────────▼────────┐
│ External DB │
│ (RDS/Cloud SQL)│
│ Multi-AZ HA │
└─────────────────┘
```

## Configuration

### Environment Variables

Set these environment variables for Redis integration:

```bash
# Required: Redis connection URL
MEMOS_REDIS_URL=redis://redis-service:6379

# Optional: Redis configuration
MEMOS_REDIS_POOL_SIZE=20 # Connection pool size
MEMOS_REDIS_DIAL_TIMEOUT=5s # Connection timeout
MEMOS_REDIS_READ_TIMEOUT=3s # Read timeout
MEMOS_REDIS_WRITE_TIMEOUT=3s # Write timeout
MEMOS_REDIS_KEY_PREFIX=memos # Key prefix for isolation
```

### Fallback Behavior

- **Redis Available**: Uses hybrid cache (Redis + local fallback)
- **Redis Unavailable**: Falls back to local-only cache (single pod)
- **Redis Failure**: Gracefully degrades to local cache until Redis recovers

## Deployment Options

### 1. Development/Testing Deployment

For testing with self-hosted database:

```bash
kubectl apply -f kubernetes-example.yaml
```

This creates:
- Self-hosted PostgreSQL with persistent storage
- Redis deployment with persistence
- Memos deployment with 3 replicas
- ReadWriteMany shared storage
- Load balancer service and ingress
- HorizontalPodAutoscaler

### 2. Production Deployment (Recommended)

For production with managed services:

```bash
# First, set up your managed database and Redis
# Then apply the production configuration:
kubectl apply -f kubernetes-production.yaml
```

This provides:
- **External managed database** (AWS RDS, Google Cloud SQL, Azure Database)
- **External managed Redis** (ElastiCache, Google Memorystore, Azure Cache)
- **ReadWriteMany storage** for shared file access
- **Pod Disruption Budget** for high availability
- **Network policies** for security
- **Advanced health checks** and graceful shutdown
- **Horizontal Pod Autoscaler** with intelligent scaling

### 3. Cloud Provider Specific Examples

#### AWS Deployment with RDS and ElastiCache

```bash
# 1. Create RDS PostgreSQL instance
aws rds create-db-instance \
--db-instance-identifier memos-db \
--db-instance-class db.t3.medium \
--engine postgres \
--master-username memos \
--master-user-password YourSecurePassword \
--allocated-storage 100 \
--vpc-security-group-ids sg-xxxxxxxx \
--db-subnet-group-name memos-subnet-group \
--multi-az \
--backup-retention-period 7

# 2. Create ElastiCache Redis cluster
aws elasticache create-replication-group \
--replication-group-id memos-redis \
--description "Memos Redis cluster" \
--node-type cache.t3.medium \
--num-cache-clusters 2 \
--port 6379

# 3. Update secrets with actual endpoints
kubectl create secret generic memos-secrets \
--from-literal=database-dsn="postgres://memos:[email protected]:5432/memos?sslmode=require"

# 4. Update ConfigMap with ElastiCache endpoint
kubectl create configmap memos-config \
--from-literal=MEMOS_REDIS_URL="redis://memos-redis.xxxxxx.cache.amazonaws.com:6379"

# 5. Deploy Memos
kubectl apply -f kubernetes-production.yaml
```

#### Google Cloud Deployment

```bash
# 1. Create Cloud SQL instance
gcloud sql instances create memos-db \
--database-version=POSTGRES_15 \
--tier=db-n1-standard-2 \
--region=us-central1 \
--availability-type=REGIONAL \
--backup \
--maintenance-window-day=SUN \
--maintenance-window-hour=06

# 2. Create Memorystore Redis instance
gcloud redis instances create memos-redis \
--size=5 \
--region=us-central1 \
--redis-version=redis_7_0

# 3. Deploy with Cloud SQL Proxy (secure connection)
kubectl apply -f kubernetes-production.yaml
```

#### Azure Deployment

```bash
# 1. Create Azure Database for PostgreSQL
az postgres server create \
--resource-group memos-rg \
--name memos-db \
--location eastus \
--admin-user memos \
--admin-password YourSecurePassword \
--sku-name GP_Gen5_2 \
--version 15

# 2. Create Azure Cache for Redis
az redis create \
--resource-group memos-rg \
--name memos-redis \
--location eastus \
--sku Standard \
--vm-size C2

# 3. Deploy Memos
kubectl apply -f kubernetes-production.yaml
```

## Monitoring and Troubleshooting

### Cache Status Endpoint

Monitor cache health via the admin API:

```bash
curl -H "Authorization: Bearer <admin-token>" \
https://your-memos-instance.com/api/v1/cache/status
```

Response includes:
```json
{
"user_cache": {
"type": "hybrid",
"size": 150,
"local_size": 45,
"redis_size": 150,
"redis_available": true,
"pod_id": "abc12345",
"event_queue_size": 0
},
"user_setting_cache": {
"type": "hybrid",
"size": 89,
"redis_available": true,
"pod_id": "abc12345"
}
}
```

### Health Checks

Monitor these indicators:

1. **Redis Connectivity**: Check `redis_available` in cache status
2. **Event Queue**: Monitor `event_queue_size` for backlog
3. **Cache Hit Rates**: Compare `local_size` vs `redis_size`
4. **Pod Distribution**: Verify requests distributed across pods

### Common Issues

#### Problem: Authentication fails after login
**Symptoms**: Users can log in but subsequent requests fail
**Cause**: Session created on one pod, request handled by another
**Solution**: Verify Redis configuration and connectivity

#### Problem: High cache misses
**Symptoms**: Poor performance, frequent database queries
**Cause**: Redis unavailable or misconfigured
**Solution**: Check Redis logs and connection settings

#### Problem: Session persistence issues
**Symptoms**: Users logged out unexpectedly
**Cause**: Redis data loss or TTL issues
**Solution**: Enable Redis persistence and verify TTL settings

## Performance Considerations

### External Database Requirements

**PostgreSQL Sizing**:
- **Small (< 100 users)**: 2 CPU, 4GB RAM, 100GB storage
- **Medium (100-1000 users)**: 4 CPU, 8GB RAM, 500GB storage
- **Large (1000+ users)**: 8+ CPU, 16GB+ RAM, 1TB+ storage

**Redis Sizing**:
- **Memory**: Base 50MB + (2KB × active sessions) + (1KB × cached settings)
- **Small**: 1GB (handles ~500K sessions)
- **Medium**: 2-4GB (handles 1-2M sessions)
- **Large**: 8GB+ (handles 4M+ sessions)

**Connection Pool Sizing**:
- Database: Start with `max_connections = 20 × number_of_pods`
- Redis: Start with `pool_size = 10 × number_of_pods`

### Scaling Guidelines

**Horizontal Pod Autoscaler**:
```yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: memos-hpa
spec:
scaleTargetRef:
kind: Deployment
name: memos
minReplicas: 2
maxReplicas: 10
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
```

**Recommended Scaling**:
- **Small (< 100 users)**: 2-3 pods, managed Redis, managed DB
- **Medium (100-1000 users)**: 3-8 pods, Redis cluster, Multi-AZ DB
- **Large (1000+ users)**: 8-20 pods, Redis cluster, read replicas
- **Enterprise**: 20+ pods, Redis cluster, DB sharding

## Security Considerations

### Redis Security

1. **Network Isolation**: Deploy Redis in private network
2. **Authentication**: Use Redis AUTH if exposed
3. **Encryption**: Enable TLS for Redis connections
4. **Access Control**: Restrict Redis access to Memos pods only

Example with Redis AUTH:
```bash
MEMOS_REDIS_URL=redis://:password@redis-service:6379
```

### Session Security

- Sessions remain encrypted in transit
- Redis stores serialized session data
- Session TTL honored across all pods
- Admin-only access to cache status endpoint

## Migration Guide

### From Single Pod to Multi-Pod

#### Option 1: Gradual Migration (Recommended)
1. **Setup External Services**: Deploy managed database and Redis
2. **Migrate Data**: Export/import existing database to managed service
3. **Update Configuration**: Add Redis and external DB environment variables
4. **Rolling Update**: Update Memos deployment with new config
5. **Scale Up**: Increase replica count gradually
6. **Verify**: Check cache status and session persistence

#### Option 2: Blue-Green Deployment
1. **Setup New Environment**: Complete production setup in parallel
2. **Data Migration**: Sync data to new environment
3. **DNS Cutover**: Switch traffic to new environment
4. **Cleanup**: Remove old environment after verification

### Rollback Strategy

If issues occur:
1. **Scale Down**: Reduce to single pod
2. **Remove Redis Config**: Environment variables
3. **Restart**: Pods will use local cache only

## Best Practices

1. **Resource Limits**: Set appropriate CPU/memory limits
2. **Health Checks**: Implement readiness/liveness probes
3. **Monitoring**: Track cache metrics and Redis health
4. **Backup**: Regular Redis data backups
5. **Testing**: Verify session persistence across pod restarts
6. **Gradual Scaling**: Increase replicas incrementally

## Additional Resources

- [Redis Kubernetes Operator](https://github.com/spotahome/redis-operator)
- [Kubernetes HPA Documentation](https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale/)
- [Session Affinity vs Distributed Sessions](https://kubernetes.io/docs/concepts/services-networking/service/#session-stickiness)

## Support

For issues or questions:
1. Check cache status endpoint first
2. Review Redis and pod logs
3. Verify environment variable configuration
4. Test with single pod to isolate issues

3 changes: 3 additions & 0 deletions go.mod
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,7 @@ require (
github.com/lib/pq v1.10.9
github.com/lithammer/shortuuid/v4 v4.2.0
github.com/pkg/errors v0.9.1
github.com/redis/go-redis/v9 v9.7.0
github.com/spf13/cobra v1.10.1
github.com/spf13/viper v1.20.1
github.com/stretchr/testify v1.10.0
Expand All @@ -38,7 +39,9 @@ require (
filippo.io/edwards25519 v1.1.0 // indirect
github.com/antlr4-go/antlr/v4 v4.13.1 // indirect
github.com/cenkalti/backoff/v4 v4.3.0 // indirect
github.com/cespare/xxhash/v2 v2.3.0 // indirect
github.com/desertbit/timer v1.0.1 // indirect
github.com/dgryski/go-rendezvous v0.0.0-20200823014737-9f7001d12a5f // indirect
github.com/dustin/go-humanize v1.0.1 // indirect
github.com/fsnotify/fsnotify v1.8.0 // indirect
github.com/go-viper/mapstructure/v2 v2.2.1 // indirect
Expand Down
Loading