Skip to content

Conversation

aritrbas
Copy link
Collaborator

Summary

This adds HTTP-based health check endpoints for the Calico VPP agent, replacing the existing restart-on-timeout behavior with Kubernetes readiness and liveness probes.

Previously, the agent container would restart frequently while waiting for Felix configuration updates. This caused pods to appear Running even when not fully initialized making it difficult to distinguish between initialization delays and actual failures.

Now, we report initialization status through standard Kubernetes probes, keeping the container running during initialization by marking it as Not Ready. This allows Kubernetes to manage pod lifecycle based on health check status.

Changes

1. New Health Package (calico-vpp-agent/health/)

Created a new package with:

  • health.go: HTTP server with three endpoints:
    • /liveness: Basic health status (for liveness probe)
    • /readiness: Initialization status (for readiness probe)
    • /status: Detailed JSON status (for monitoring/debugging)

2. Configuration Changes (config/config.go)

Added healthcheck port configuration:

// HealthCheckPort is the port on which the health check HTTP server listens
// Defaults to 9090
HealthCheckPort *uint32 `json:"healthCheckPort"`

The healthcheck port can be customized via ConfigMap:

  CALICOVPP_INITIAL_CONFIG: |-
    {
      "healthCheckPort": 9090,
    }

3. Deployment YAML Changes (yaml/base/calico-vpp-daemonset.yaml)

Added Kubernetes health probes to agent container:

  startupProbe:
    failureThreshold: 10
    httpGet:
      path: /liveness
      port: 9090
      scheme: HTTP
    initialDelaySeconds: 30
    periodSeconds: 30
    timeoutSeconds: 3

  livenessProbe:
    failureThreshold: 3
    httpGet:
      path: /liveness
      port: 9090
      scheme: HTTP
    initialDelaySeconds: 30
    periodSeconds: 10
    timeoutSeconds: 3

  readinessProbe:
    failureThreshold: 3
    httpGet:
      path: /readiness
      port: 9090
      scheme: HTTP
    initialDelaySeconds: 10
    periodSeconds: 5
    timeoutSeconds: 3

Components Tracked

The health system tracks the initialization of these components:

  1. vpp: VPP connection established
  2. vpp-manager: VPP Manager ready
  3. felix: Felix configuration received
  4. agent: Agent fully initialized and running

Monitoring

The /status endpoint provides detailed information about the healhcheck status. Here is an example status response:

{
  "healthy": true,
  "ready": true,
  "components": {
    "agent": {
      "initialized": true,
      "message": "Agent fully initialized and running",
      "updatedAt": "2024-10-15T22:30:00Z"
    },
    "felix": {
      "initialized": true,
      "message": "Felix config received",
      "updatedAt": "2024-10-15T22:29:45Z"
    },
    "vpp": {
      "initialized": true,
      "message": "VPP connection established",
      "updatedAt": "2024-10-15T22:29:30Z"
    },
    "vpp-manager": {
      "initialized": true,
      "message": "VPP Manager ready",
      "updatedAt": "2024-10-15T22:29:35Z"
    }
  },
  "message": "All components initialized",
  "lastUpdate": "2024-10-15T22:30:00Z"
}

Copy link
Collaborator

@hedibouattour hedibouattour left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this change, great !
If I understand correctly this is never gonna timeout and crash ? So if felix doesn't send its config at all we are just stuck at notReady state ?

healthServer.MarkAsUnhealthy("Waiting for Felix configuration")
log.Info("Waiting for Felix configuration...")

ticker := time.NewTicker(20 * time.Second)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can consider reducing the interval; 20s might be too long for retries

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants