usememos · harsharahul · Sep 6, 2025 · Sep 6, 2025
@@ -0,0 +1,380 @@
+# Kubernetes High Availability and Scaling Guide
+
+This guide explains how to deploy Memos in a Kubernetes environment with proper session management for horizontal scaling and high availability.
+
+## Description
+
+Till v0.25.0, Memos had limitations when deployed as multiple pods in Kubernetes:
+
+1. **Session Isolation**: Each pod maintained its own in-memory session cache, causing authentication inconsistencies when load balancers directed users to different pods.
+
+2. **SSO Redirect Issues**: OAuth2 authentication flows would fail when:
+   - User initiated login on Pod A
+   - OAuth provider redirected back to Pod B
+   - Pod B couldn't validate the session created by Pod A
+
+3. **Cache Inconsistency**: Session updates on one pod weren't reflected on other pods until cache expiry (10+ minutes).
+
+## Solution Overview
+
+The solution implements a **distributed cache system** with the following features:
+
+- **Redis-backed shared cache** for session synchronization across pods
+- **Hybrid cache strategy** with local cache fallback for resilience
+- **Event-driven cache invalidation** for real-time consistency
+- **Backward compatibility** - works without Redis for single-pod deployments
+
+## Architecture
+
+### Production Architecture with External Services
+
+```
+┌─────────────────────────────────────────────────────────────┐
+│                Load Balancer (Ingress)                     │
+└─────────────┬─────────────┬─────────────┬─────────────────┘
+              │             │             │
+         ┌────▼────┐   ┌────▼────┐   ┌────▼────┐
+         │  Pod A  │   │  Pod B  │   │  Pod C  │
+         │         │   │         │   │         │
+         └────┬────┘   └────┬────┘   └────┬────┘
+              │             │             │
+              └─────────────┼─────────────┘
+                            │
+              ┌─────────────┼─────────────┐
+              │             │             │
+    ┌─────────▼─────────┐   │   ┌─────────▼─────────┐
+    │  Redis Cache      │   │   │  ReadWriteMany    │
+    │  (ElastiCache)    │   │   │  Storage (EFS)    │
+    │  Distributed      │   │   │  Shared Files     │
+    │  Sessions         │   │   │  & Attachments    │
+    └───────────────────┘   │   └───────────────────┘
+                            │
+                   ┌────────▼────────┐
+                   │  External DB    │
+                   │  (RDS/Cloud SQL)│
+                   │  Multi-AZ HA    │
+                   └─────────────────┘
+```
+
+## Configuration
+
+### Environment Variables
+
+Set these environment variables for Redis integration:
+
+```bash
+# Required: Redis connection URL
+MEMOS_REDIS_URL=redis://redis-service:6379
+
+# Optional: Redis configuration
+MEMOS_REDIS_POOL_SIZE=20                    # Connection pool size
+MEMOS_REDIS_DIAL_TIMEOUT=5s                 # Connection timeout
+MEMOS_REDIS_READ_TIMEOUT=3s                 # Read timeout  
+MEMOS_REDIS_WRITE_TIMEOUT=3s                # Write timeout
+MEMOS_REDIS_KEY_PREFIX=memos                # Key prefix for isolation
+```
+
+### Fallback Behavior
+
+- **Redis Available**: Uses hybrid cache (Redis + local fallback)
+- **Redis Unavailable**: Falls back to local-only cache (single pod)
+- **Redis Failure**: Gracefully degrades to local cache until Redis recovers
+
+## Deployment Options
+
+### 1. Development/Testing Deployment
+
+For testing with self-hosted database:
+
+```bash
+kubectl apply -f kubernetes-example.yaml
+```
+
+This creates:
+- Self-hosted PostgreSQL with persistent storage
+- Redis deployment with persistence  
+- Memos deployment with 3 replicas
+- ReadWriteMany shared storage
+- Load balancer service and ingress
+- HorizontalPodAutoscaler
+
+### 2. Production Deployment (Recommended)
+
+For production with managed services:
+
+```bash
+# First, set up your managed database and Redis
+# Then apply the production configuration:
+kubectl apply -f kubernetes-production.yaml
+```
+
+This provides:
+- **External managed database** (AWS RDS, Google Cloud SQL, Azure Database)
+- **External managed Redis** (ElastiCache, Google Memorystore, Azure Cache)
+- **ReadWriteMany storage** for shared file access
+- **Pod Disruption Budget** for high availability
+- **Network policies** for security
+- **Advanced health checks** and graceful shutdown
+- **Horizontal Pod Autoscaler** with intelligent scaling
+
+### 3. Cloud Provider Specific Examples
+
+#### AWS Deployment with RDS and ElastiCache
+
+```bash
+# 1. Create RDS PostgreSQL instance
+aws rds create-db-instance \
+  --db-instance-identifier memos-db \
+  --db-instance-class db.t3.medium \
+  --engine postgres \
+  --master-username memos \
+  --master-user-password YourSecurePassword \
+  --allocated-storage 100 \
+  --vpc-security-group-ids sg-xxxxxxxx \
+  --db-subnet-group-name memos-subnet-group \
+  --multi-az \
+  --backup-retention-period 7
+
+# 2. Create ElastiCache Redis cluster
+aws elasticache create-replication-group \
+  --replication-group-id memos-redis \
+  --description "Memos Redis cluster" \
+  --node-type cache.t3.medium \
+  --num-cache-clusters 2 \
+  --port 6379
+
+# 3. Update secrets with actual endpoints
+kubectl create secret generic memos-secrets \
+  --from-literal=database-dsn="postgres://memos:[email protected]:5432/memos?sslmode=require"
+
+# 4. Update ConfigMap with ElastiCache endpoint
+kubectl create configmap memos-config \
+  --from-literal=MEMOS_REDIS_URL="redis://memos-redis.xxxxxx.cache.amazonaws.com:6379"
+
+# 5. Deploy Memos
+kubectl apply -f kubernetes-production.yaml
+```
+
+#### Google Cloud Deployment
+
+```bash
+# 1. Create Cloud SQL instance
+gcloud sql instances create memos-db \
+  --database-version=POSTGRES_15 \
+  --tier=db-n1-standard-2 \
+  --region=us-central1 \
+  --availability-type=REGIONAL \
+  --backup \
+  --maintenance-window-day=SUN \
+  --maintenance-window-hour=06
+
+# 2. Create Memorystore Redis instance  
+gcloud redis instances create memos-redis \
+  --size=5 \
+  --region=us-central1 \
+  --redis-version=redis_7_0
+
+# 3. Deploy with Cloud SQL Proxy (secure connection)
+kubectl apply -f kubernetes-production.yaml
+```
+
+#### Azure Deployment
+
+```bash
+# 1. Create Azure Database for PostgreSQL
+az postgres server create \
+  --resource-group memos-rg \
+  --name memos-db \
+  --location eastus \
+  --admin-user memos \
+  --admin-password YourSecurePassword \
+  --sku-name GP_Gen5_2 \
+  --version 15
+
+# 2. Create Azure Cache for Redis
+az redis create \
+  --resource-group memos-rg \
+  --name memos-redis \
+  --location eastus \
+  --sku Standard \
+  --vm-size C2
+
+# 3. Deploy Memos
+kubectl apply -f kubernetes-production.yaml
+```
+
+## Monitoring and Troubleshooting
+
+### Cache Status Endpoint
+
+Monitor cache health via the admin API:
+
+```bash
+curl -H "Authorization: Bearer <admin-token>" \
+  https://your-memos-instance.com/api/v1/cache/status
+```
+
+Response includes:
+```json
+{
+  "user_cache": {
+    "type": "hybrid",
+    "size": 150,
+    "local_size": 45,
+    "redis_size": 150,
+    "redis_available": true,
+    "pod_id": "abc12345",
+    "event_queue_size": 0
+  },
+  "user_setting_cache": {
+    "type": "hybrid",
+    "size": 89,
+    "redis_available": true,
+    "pod_id": "abc12345"
+  }
+}
+```
+
+### Health Checks
+
+Monitor these indicators:
+
+1. **Redis Connectivity**: Check `redis_available` in cache status
+2. **Event Queue**: Monitor `event_queue_size` for backlog
+3. **Cache Hit Rates**: Compare `local_size` vs `redis_size`
+4. **Pod Distribution**: Verify requests distributed across pods
+
+### Common Issues
+
+#### Problem: Authentication fails after login
+**Symptoms**: Users can log in but subsequent requests fail
+**Cause**: Session created on one pod, request handled by another
+**Solution**: Verify Redis configuration and connectivity
+
+#### Problem: High cache misses
+**Symptoms**: Poor performance, frequent database queries  
+**Cause**: Redis unavailable or misconfigured
+**Solution**: Check Redis logs and connection settings
+
+#### Problem: Session persistence issues
+**Symptoms**: Users logged out unexpectedly
+**Cause**: Redis data loss or TTL issues
+**Solution**: Enable Redis persistence and verify TTL settings
+
+## Performance Considerations
+
+### External Database Requirements
+
+**PostgreSQL Sizing**:
+- **Small (< 100 users)**: 2 CPU, 4GB RAM, 100GB storage
+- **Medium (100-1000 users)**: 4 CPU, 8GB RAM, 500GB storage  
+- **Large (1000+ users)**: 8+ CPU, 16GB+ RAM, 1TB+ storage
+
+**Redis Sizing**:
+- **Memory**: Base 50MB + (2KB × active sessions) + (1KB × cached settings)
+- **Small**: 1GB (handles ~500K sessions)
+- **Medium**: 2-4GB (handles 1-2M sessions)
+- **Large**: 8GB+ (handles 4M+ sessions)
+
+**Connection Pool Sizing**:
+- Database: Start with `max_connections = 20 × number_of_pods`
+- Redis: Start with `pool_size = 10 × number_of_pods`
+
+### Scaling Guidelines
+
+**Horizontal Pod Autoscaler**:
+```yaml
+apiVersion: autoscaling/v2
+kind: HorizontalPodAutoscaler
+metadata:
+  name: memos-hpa
+spec:
+  scaleTargetRef:
+    kind: Deployment
+    name: memos
+  minReplicas: 2
+  maxReplicas: 10
+  metrics:
+  - type: Resource
+    resource:
+      name: cpu
+      target:
+        type: Utilization
+        averageUtilization: 70
+```
+
+**Recommended Scaling**:
+- **Small (< 100 users)**: 2-3 pods, managed Redis, managed DB
+- **Medium (100-1000 users)**: 3-8 pods, Redis cluster, Multi-AZ DB
+- **Large (1000+ users)**: 8-20 pods, Redis cluster, read replicas
+- **Enterprise**: 20+ pods, Redis cluster, DB sharding
+
+## Security Considerations
+
+### Redis Security
+
+1. **Network Isolation**: Deploy Redis in private network
+2. **Authentication**: Use Redis AUTH if exposed
+3. **Encryption**: Enable TLS for Redis connections
+4. **Access Control**: Restrict Redis access to Memos pods only
+
+Example with Redis AUTH:
+```bash
+MEMOS_REDIS_URL=redis://:password@redis-service:6379
+```
+
+### Session Security
+
+- Sessions remain encrypted in transit
+- Redis stores serialized session data
+- Session TTL honored across all pods
+- Admin-only access to cache status endpoint
+
+## Migration Guide
+
+### From Single Pod to Multi-Pod
+
+#### Option 1: Gradual Migration (Recommended)
+1. **Setup External Services**: Deploy managed database and Redis
+2. **Migrate Data**: Export/import existing database to managed service
+3. **Update Configuration**: Add Redis and external DB environment variables
+4. **Rolling Update**: Update Memos deployment with new config
+5. **Scale Up**: Increase replica count gradually
+6. **Verify**: Check cache status and session persistence
+
+#### Option 2: Blue-Green Deployment
+1. **Setup New Environment**: Complete production setup in parallel
+2. **Data Migration**: Sync data to new environment
+3. **DNS Cutover**: Switch traffic to new environment
+4. **Cleanup**: Remove old environment after verification
+
+### Rollback Strategy
+
+If issues occur:
+1. **Scale Down**: Reduce to single pod
+2. **Remove Redis Config**: Environment variables
+3. **Restart**: Pods will use local cache only
+
+## Best Practices
+
+1. **Resource Limits**: Set appropriate CPU/memory limits
+2. **Health Checks**: Implement readiness/liveness probes  
+3. **Monitoring**: Track cache metrics and Redis health
+4. **Backup**: Regular Redis data backups
+5. **Testing**: Verify session persistence across pod restarts
+6. **Gradual Scaling**: Increase replicas incrementally
+
+## Additional Resources
+
+- [Redis Kubernetes Operator](https://github.com/spotahome/redis-operator)
+- [Kubernetes HPA Documentation](https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale/)
+- [Session Affinity vs Distributed Sessions](https://kubernetes.io/docs/concepts/services-networking/service/#session-stickiness)
+
+## Support
+
+For issues or questions:
+1. Check cache status endpoint first
+2. Review Redis and pod logs
+3. Verify environment variable configuration
+4. Test with single pod to isolate issues
+
@@ -20,6 +20,7 @@ require (
 	github.com/lib/pq v1.10.9
 	github.com/lithammer/shortuuid/v4 v4.2.0
 	github.com/pkg/errors v0.9.1
+	github.com/redis/go-redis/v9 v9.7.0
 	github.com/spf13/cobra v1.10.1
 	github.com/spf13/viper v1.20.1
 	github.com/stretchr/testify v1.10.0
@@ -38,7 +39,9 @@ require (
 	filippo.io/edwards25519 v1.1.0 // indirect
 	github.com/antlr4-go/antlr/v4 v4.13.1 // indirect
 	github.com/cenkalti/backoff/v4 v4.3.0 // indirect
+	github.com/cespare/xxhash/v2 v2.3.0 // indirect
 	github.com/desertbit/timer v1.0.1 // indirect
+	github.com/dgryski/go-rendezvous v0.0.0-20200823014737-9f7001d12a5f // indirect
 	github.com/dustin/go-humanize v1.0.1 // indirect
 	github.com/fsnotify/fsnotify v1.8.0 // indirect
 	github.com/go-viper/mapstructure/v2 v2.2.1 // indirect