Skip to content

Latest commit

 

History

History
308 lines (225 loc) · 16.3 KB

File metadata and controls

308 lines (225 loc) · 16.3 KB

PRPilot Incident Postmortem: RSA Key Disposal Bug

Date: May 14, 2026 Author: Austin Chima Severity: Critical — Complete loss of core functionality Duration: ~4 days (May 10 – May 14, 2026) Status: Resolved and deployed


1. Executive Summary

PRPilot is a GitHub App that automatically reviews pull requests using AI (Google Gemini). It runs as a serverless API on Google Cloud Run. For approximately 4 days, the bot was unable to post any review comments on pull requests, despite successfully receiving and processing webhook events.

The root cause was a premature disposal of an RSA cryptographic key in the JWT generation method. This caused an ObjectDisposedException every time the bot attempted to authenticate with GitHub's API, silently killing the review pipeline.

The fix was a 3-line change to the key lifecycle management. After deployment, the bot immediately began posting reviews under its own bot identity (gatemark[bot]).


2. System Architecture (Context You Need to Explain This)

┌──────────────┐     webhook      ┌────────────────┐     enqueue     ┌───────────────────┐
│   GitHub     │ ──────────────►  │  PRPilot API   │ ─────────────►  │ ReviewBackground  │
│  (PR opened) │   POST /api/     │  (Cloud Run)   │   Channel<T>    │ Worker            │
└──────────────┘  github/webhooks └────────────────┘                 └────────┬──────────┘
                                                                              │
                                                          ┌───────────────────┘
                                                          ▼
                                          ┌──────────────────────────┐
                                          │  GitHubAppAuthenticator  │
                                          │                          │
                                          │  1. Generate JWT         │
                                          │  2. Exchange for token   │
                                          │  3. Create GitHub client │
                                          └──────────┬───────────────┘
                                                     │
                                    ┌────────────────┘
                                    ▼
                   ┌─────────────────────────────────┐
                   │  GitHubClientAdapter            │
                   │                                 │
                   │  • GetPullRequestDiffAsync()    │──► GitHub API (read diff)
                   │  • PostPullRequestCommentAsync()│──► GitHub API (write comment)
                   └─────────────────────────────────┘
                                    │
                   diff goes to     │
                                    ▼
                   ┌─────────────────────────────────┐
                   │  GeminiCodeReviewService        │
                   │  • GetReviewAsync(diff)         │──► Gemini API (AI review)
                   └─────────────────────────────────┘

The Authentication Flow (Critical to Understand)

GitHub Apps don't use personal access tokens (PATs). They use a two-step authentication process:

  1. Generate a JWT — The App creates a short-lived JSON Web Token signed with its RSA private key. This JWT proves "I am App #3662925."
  2. Exchange JWT for Installation Token — The App sends the JWT to GitHub's API, which returns a short-lived installation token scoped to the repos where the App is installed.
  3. Use the Installation Token — All API calls (read diffs, post comments) use this token. Comments posted with it appear as gatemark[bot], not as any human user.

3. The Debugging Process (Step by Step)

Phase 1: Understand the Symptom

What we knew:

  • The webhook was reaching Cloud Run (health endpoint responded, logs showed event processing)
  • No review comments were appearing on any PRs
  • The only AI-style review comment ever found (on PR #1) was posted under the user's personal account, not a bot

Key question: Is the system failing silently, or is it not running at all?

How to check: Production logs on Cloud Run. Previous sessions found Octokit.AuthorizationException: Requires authentication in the logs — confirming the system was running but failing at the GitHub API call step.

Tip

Debugging Lesson #1: Always check production logs first. Don't assume the system isn't running. A "silent failure" (the system runs but produces no output) is a different problem than "the system won't start."


Phase 2: Identify the Trust Gap

I examined the test suite and found a critical problem: the tests couldn't prove the App auth worked, because they were testing a completely different auth path.

Test Type Auth Method What It Proves
Unit Tests (PRPilot.UnitTests) Mocked IGitHubAppAuthenticator Business logic works, auth is skipped entirely
Integration Tests (GitHubClientAdapterTests) Personal Access Token (PAT) The adapter can talk to GitHub as the user, not as the bot
Missing GitHub App JWT → Installation Token Nothing tested this path

The integration tests used:

_client.Credentials = new Credentials(_githubToken); // PAT = posts as YOU

So when integration tests passed and posted a comment, it appeared under austinchima — not gatemark[bot]. The passing tests created a false sense of confidence.

Important

Debugging Lesson #2: Verify that your tests actually test the code path that's failing. If your production system uses Auth Method A but your tests use Auth Method B, passing tests tell you nothing about the production failure.


Phase 3: Reproduce the Failure Locally

Instead of guessing, I wrote new integration tests that exercise the actual production auth flow:

// This is what production does — NOT what the old tests did
var authenticator = CreateAuthenticator(); // uses App ID + Private Key
var client = await authenticator.CreateInstallationClientAsync(installationId);
await client.PostPullRequestCommentAsync(repo, prNumber, comment);

First run results:

✅ CreateInstallationClientAsync_ValidCredentials_ReturnsClient  [458 ms]
✅ AppAuth_CanFetchPullRequestDiff                                [635 ms]
❌ AppAuth_CanPostCommentAsBotIdentity                            [4 ms]   ← CRASH
❌ FullPipeline_AppAuth_FetchDiff_PostComment                     [4 ms]   ← CRASH

The error:

System.ObjectDisposedException: Cannot access a disposed object.
Object name: 'System.Security.Cryptography.RSABCrypt'.

Tip

Debugging Lesson #3: If you can't reproduce it, you can't fix it. The single most valuable thing I did was write a test that exercised the exact same code path as production. The bug appeared immediately.


Phase 4: Find the Root Cause

The stack trace pointed directly at the problem:

at System.Security.Cryptography.RSABCrypt.GetKey()
at ...AsymmetricSignatureProvider.Sign(...)
at ...JsonWebTokenHandler.CreateToken(...)
at PRPilot.Infrastructure.GitHubAppAuthenticator.GenerateJwt()    ← HERE

Here's the buggy code:

private string GenerateJwt()
{
    using var rsa = RSA.Create();        // ← PROBLEM: disposed at method exit
    rsa.ImportFromPem(pem);

    var securityKey = new RsaSecurityKey(rsa);  // ← holds a REFERENCE to rsa
    var credentials = new SigningCredentials(securityKey, SecurityAlgorithms.RsaSha256);
    
    var handler = new JsonWebTokenHandler();
    return handler.CreateToken(descriptor);      // ← uses rsa through securityKey
}

Why did the first test pass but the second fail?

This is the subtle part. Each test creates a new GitHubAppAuthenticator instance with a new MemoryCache. So each test calls CreateInstallationClientAsync() which calls GenerateJwt() fresh.

The using var rsa statement means: "dispose this RSA object when the current scope ends." In the GenerateJwt() method, the RSA object gets disposed after handler.CreateToken() returns. So the first call in a fresh process usually works — the token is created before disposal.

But here's the catch: the RsaSecurityKey and SigningCredentials wrap the RSA object by reference, and the underlying cryptographic provider may retain lazy references to the key's native handle. On the second invocation (in the same process), the .NET cryptographic provider pool can reuse internal state from the previous (now-disposed) RSA object, triggering the ObjectDisposedException.

This is a race condition between disposal and the crypto provider's internal caching. It's non-deterministic — sometimes the first call fails too, depending on GC timing and provider pooling.

Call 1: RSA.Create() → Sign → Dispose → ✅ (token was created before disposal finalized)
Call 2: RSA.Create() → Sign → 💥 ObjectDisposedException 
        (crypto provider pool references a disposed handle from Call 1)

Important

Debugging Lesson #4: using statements are not always safe. If an object wraps your resource by reference (not by copy), disposing the original destroys the wrapper's reference too. This is especially dangerous with cryptographic keys, database connections, and streams.


Phase 5: Implement the Fix

The fix ensures the RsaSecurityKey gets its own independent RSA key that nobody else will dispose:

 private string GenerateJwt()
 {
-    using var rsa = RSA.Create();
     var pem = _privateKey.Contains("\\n") ? _privateKey.Replace("\\n", "\n") : _privateKey;
+
+    var rsa = RSA.Create();
     rsa.ImportFromPem(pem);
 
-    var securityKey = new RsaSecurityKey(rsa);
+    // Export the key parameters and dispose the original safely
+    var rsaParameters = rsa.ExportParameters(includePrivateParameters: true);
+    rsa.Dispose();
+
+    // Create a fresh, standalone RSA key from the exported parameters
+    var signingRsa = RSA.Create(rsaParameters);
+    var securityKey = new RsaSecurityKey(signingRsa);
     var credentials = new SigningCredentials(securityKey, SecurityAlgorithms.RsaSha256);

Why this works:

  • rsa.ExportParameters() copies the key material into a plain RSAParameters struct (a value type — no references)
  • We dispose the original rsa safely (we don't need it anymore)
  • RSA.Create(rsaParameters) creates a completely new, independent RSA instance
  • RsaSecurityKey wraps this new instance — and since nothing else holds a reference to it, nothing can dispose it prematurely

Phase 6: Verify the Fix

Re-ran all 4 App auth tests:

✅ CreateInstallationClientAsync_ValidCredentials_ReturnsClient  [458 ms]
✅ AppAuth_CanFetchPullRequestDiff                                [564 ms]
✅ AppAuth_CanPostCommentAsBotIdentity                            [793 ms]  ← FIXED
✅ FullPipeline_AppAuth_FetchDiff_PostComment                     [8 s]    ← FIXED

Verified on GitHub: Two comments appeared on PR #7 from gatemark Bot with the [Bot] badge — one test comment and one full AI code review.

Full test suite:

✅ PRPilot.Core.Tests:           9/9 passed
✅ PRPilot.Infrastructure.Tests: 7/7 passed
✅ PRPilot.UnitTests:            10/10 passed
✅ New App Auth Tests:            4/4 passed
❌ Old PAT Tests:                 2 failed (expired PAT — pre-existing, unrelated)

Phase 7: Deploy and Confirm

  • Built container image via Google Cloud Build
  • Deployed revision prpilot-api-00037-qs4 to Cloud Run
  • Health endpoint confirmed: {"status":"Looking good over here!"}

4. Why This Bug Was Hard to Find

Factor How It Hid the Bug
Wrong test coverage Integration tests used PATs, not App auth — they passed but proved nothing about the failing code path
Silent production failure The ReviewBackgroundWorker caught the exception and logged it, but didn't crash the service. The webhook returned 200 OK to GitHub, so no delivery failures appeared
Non-deterministic timing The first JWT generation sometimes succeeded depending on GC timing, making it look like "auth works but something else is wrong"
Misleading earlier fix A previous session changed AuthenticationType.Oauth to AuthenticationType.Bearer, which was also necessary but masked the deeper RSA disposal bug
No end-to-end test existed Nobody had ever tested: JWT → installation token → post comment in a single test

5. Key Takeaways for Interviews

When explaining this bug:

  1. Start with the architecture — "I built a GitHub App that uses a serverless API on Cloud Run to automatically review PRs using AI."

  2. Describe the symptom — "The bot received webhooks and processed them, but never posted comments. Production logs showed authentication failures."

  3. Explain the debugging methodology:

    • Checked production logs → found the auth exception
    • Audited the test suite → found tests used a different auth path (PATs vs App auth)
    • Wrote targeted integration tests that exercised the real production code path
    • Reproduced the exact error locally on the first run
  4. Explain the root cause — "The RSA private key used for JWT signing was wrapped in a using statement. The RsaSecurityKey class holds a reference to the RSA object, not a copy. When the using block disposed the RSA key, the security key's reference became invalid, causing an ObjectDisposedException on subsequent signing operations."

  5. Explain the fix — "I exported the RSA parameters to a value-type struct, disposed the original key safely, then created a new standalone RSA instance from those parameters. The RsaSecurityKey now owns a key that no other code can dispose."

  6. Emphasize the process improvements:

    • Added 4 new integration tests that cover the real App auth flow
    • These tests now serve as a regression guard — if auth breaks again, CI catches it

Concepts this demonstrates:

  • Object lifecycle management in C# (IDisposable, using, reference vs. value semantics)
  • Cryptographic key management (RSA, JWT signing, key material export)
  • OAuth-style authentication flows (JWT → token exchange → scoped API access)
  • Serverless debugging (Cloud Run logs, silent failures, health checks)
  • Test strategy gaps (unit tests with mocks vs. integration tests that prove real behavior)
  • Systematic debugging (don't guess → reproduce → fix → verify → deploy)

6. Prevention Checklist

  • Integration tests must cover the production auth path — not just a substitute (PAT vs App auth)
  • Avoid using on objects that are wrapped by reference — especially crypto keys, streams, and DB connections
  • Add alerting for background worker failures — the bot should notify (e.g., Slack, email) when it fails to post a review, not just log it
  • End-to-end smoke test after deploy — trigger a test PR automatically after each deployment to confirm the full pipeline works

7. Files Changed

File Change
GitHubAppAuthenticator.cs Fixed RSA key disposal in GenerateJwt()
GitHubAppAuthenticatorTests.cs Added 4 new end-to-end App auth integration tests
PRPilot.IntegrationTests.csproj Added project reference to PRPilot.Infrastructure