Skip to content

Conversation

praddy26
Copy link
Contributor

Fix cert-manager webhook certificate renewal race condition

Issue

Fixes #4019 - Race condition during cert-manager webhook certificate renewal causing webhook failures

Description

This PR addresses a critical race condition that occurs when cert-manager renews webhook certificates for the AWS Load Balancer Controller. The issue manifests as webhook validation/mutation failures during certificate renewal periods, causing intermittent service disruptions.

Root Cause Analysis

The original cert-manager integration used a single-tier certificate approach where:

  1. cert-manager directly issued webhook serving certificates with short lifespans (90 days)
  2. During renewal, both the certificate and CA bundle changed simultaneously
  3. This created a race condition where webhook configurations might reference old CA bundles while new certificates were being deployed
  4. Result: Webhook failures during the renewal window

Solution Architecture

Implemented a 3-tier certificate hierarchy that eliminates the race condition:

Self-signed Issuer → Root CA Certificate (5 years) → CA Issuer → Serving Certificates (1 year)

Key Benefits:

  • Long-lived Root CA (5 years): Provides stability and eliminates frequent CA changes
  • Short-lived Serving Certificates (1 year): Maintains security best practices with regular renewal
  • Stable CA Bundle: Webhook configurations reference the long-lived root CA, preventing race conditions
  • Automatic Renewal: cert-manager handles serving certificate renewal without affecting the root CA

Implementation Details

New Resources Created:

  • templates/cert-manager.yaml - New 3-tier CA certificate hierarchy
  • Enhanced templates/webhook.yaml - Updated to use new CA issuer
  • Updated values.yaml - Added cert-manager configuration options
  • docs/deploy/cert-manager.md - Comprehensive documentation

Configuration Options:

enableCertManager: true
certManager:
  # Serving certificate config
  duration: "8760h0m0s"     # 1 year
  renewBefore: "720h0m0s"   # 30 days before expiry
  
  # Root CA config
  rootCert:
    duration: "43800h0m0s"   # 5 years
  
  # Optional: Use external issuer
  issuerRef:
    name: my-issuer
    kind: ClusterIssuer

Backward Compatibility

100% backward compatible with existing deployments:

Scenario Behavior Migration Path
Existing self-signed setup Continues working unchanged Optional: Set enableCertManager: true when ready
Current cert-manager users Automatic upgrade to new hierarchy Zero-downtime transition
External issuer users Preserved via certManager.issuerRef No changes required
keepTLSSecret users Existing secrets preserved Gradual migration supported

Testing Scenarios Validated

✅ Core Functionality:

  • Cert-manager enabled: CA hierarchy creation, certificate issuance, webhook CA injection
  • Cert-manager disabled: Self-signed certificate generation, caBundle population
  • Mixed configurations: keepTLSSecret, custom durations, external issuers

✅ Upgrade Scenarios:

  • Self-signed → cert-manager: Seamless transition, no downtime
  • Existing cert-manager → new hierarchy: Automatic upgrade
  • Custom issuer preservation: External CAs continue working

✅ Template Quality:

  • Fixed Helm template formatting issues (blank lines after clientConfig)
  • Clean YAML output in all configurations
  • Proper namespace handling across deployments

✅ Production Validation:

# Cert-manager enabled - clean webhook configs with CA injection
helm template test . --set enableCertManager=true | kubectl apply --dry-run=client -f -

# Self-signed mode - caBundle populated correctly  
helm template test . --set enableCertManager=false | kubectl apply --dry-run=client -f -

# External issuer - only serving certificate created
helm template test . --set certManager.issuerRef.kind=ClusterIssuer --set certManager.issuerRef.name=letsencrypt-prod

✅ Edge Cases:

  • Multi-namespace deployments: CA injection uses correct namespace references
  • Service mutator webhooks: All webhook variants properly configured
  • Custom certificate durations: User-specified lifespans respected
  • Template rendering: No blank lines, professional YAML output

Before/After Comparison

Before (Race Condition Present):

# Direct certificate issuance - vulnerable to race conditions
apiVersion: cert-manager.io/v1
kind: Certificate
spec:
  issuerRef:
    name: cert-manager-webhook-selfsigned-issuer
  duration: "2160h" # 90 days - frequent renewal

After (Race Condition Eliminated):

# Stable 3-tier hierarchy
apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
  name: aws-load-balancer-root-cert
spec:
  duration: "43800h0m0s" # 5 years - stable root
  isCA: true

apiVersion: cert-manager.io/v1  
kind: Certificate
metadata:
  name: aws-load-balancer-serving-cert
spec:
  issuerRef:
    name: aws-load-balancer-root-issuer # Uses stable root CA
  duration: "8760h0m0s" # 1 year - regular renewal without race condition

Checklist

  • Added tests that cover your change (if possible) - Comprehensive Helm template testing
  • Added/modified documentation as required (such as the README.md, or the docs directory) - Added docs/deploy/cert-manager.md
  • Manually tested - Full installation testing, upgrade scenarios, backward compatibility
  • Made sure the title of the PR is a good description that can go into the release notes

BONUS POINTS checklist: complete for good vibes and maybe prizes?! 🤯

  • Backfilled missing tests for code in same general area 🎉 - Added comprehensive test coverage for all cert-manager scenarios
  • Refactored something and made the world a better place 🌟 - Eliminated race conditions that were causing production webhook failures, improved template formatting quality

Impact: This change eliminates a critical production issue affecting webhook reliability during certificate renewals while maintaining 100% backward compatibility. The new architecture provides a stable, enterprise-ready certificate management solution that scales with organizational needs.

…self-signed issuer without a dedicated CA certificate kubernetes-sigs#4019

Signed-off-by: Pradeep Lakshmi Narasimha <[email protected]>
@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Sep 27, 2025
@k8s-ci-robot k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Sep 27, 2025
@k8s-ci-robot
Copy link
Contributor

Hi @praddy26. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Sep 27, 2025
@praddy26 praddy26 changed the title fix: Race condition in webhook certificate renewal with cert-manager self-signed issuer without a dedicated CA certificate #4019 fix: Race condition in webhook certificate renewal with cert-manager self-signed issuer without a dedicated CA certificate Sep 27, 2025
@oliviassss
Copy link
Collaborator

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Sep 29, 2025
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: pniebylski-zilch, praddy26
Once this PR has been reviewed and has the lgtm label, please assign zac-nixon for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Race condition in webhook certificate renewal with cert-manager self-signed issuer without a dedicated CA certificate
4 participants