Skip to content

fix: runtime cache health check bypassing stale connections#192

Merged
0xbbjoker merged 4 commits intodevfrom
fix/runtime-cache-health-check-key-mismatch
Jan 7, 2026
Merged

fix: runtime cache health check bypassing stale connections#192
0xbbjoker merged 4 commits intodevfrom
fix/runtime-cache-health-check-key-mismatch

Conversation

@0xbbjoker
Copy link
Copy Markdown
Contributor

@0xbbjoker 0xbbjoker commented Jan 5, 2026

RuntimeCache Health Check Fix

The RuntimeCache was storing the cache key (e.g., "agentId:ws") as entry.agentId, but DbAdapterPool stores adapters under just "agentId". This caused health checks to always return true for webSearch-enabled runtimes because the adapter lookup never found a match.

Solution: RuntimeCache.set() now accepts the actual agentId separately from the cache key, ensuring health checks properly detect and evict runtimes with closed DB connections.


Note

Restores proper DB health checks for cached runtimes by decoupling cache key (e.g., agentId:ws) from the actual agentId.

  • Changes RuntimeCache.set(cacheKey, runtime, characterName, actualAgentId) to persist the true agentId while using cacheKey for lookup
  • Updates call site in createRuntimeForUser to pass agentId separately and improves cache logging
  • Ensures getWithHealthCheck checks the correct adapter via DbAdapterPool.checkHealth(entry.agentId) and evicts unhealthy runtimes

Written by Cursor Bugbot for commit 929fa5a. This will update automatically on new commits. Configure here.

@vercel
Copy link
Copy Markdown

vercel Bot commented Jan 5, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Review Updated (UTC)
eliza-cloud-v2 Ready Ready Preview, Comment Jan 7, 2026 9:23pm
miniapp Error Error Jan 7, 2026 9:23pm

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Jan 5, 2026

Important

Review skipped

Auto reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.


Note

🎁 Summarized by CodeRabbit Free

Your organization is on the Free plan. CodeRabbit will generate a high-level summary and a walkthrough for each pull request. For a comprehensive line-by-line review, please upgrade your subscription to CodeRabbit Pro by visiting https://app.coderabbit.ai/login.

Comment @coderabbitai help to get the list of available commands and usage tips.

@claude
Copy link
Copy Markdown

claude Bot commented Jan 5, 2026

Code Review - PR #192: Runtime Cache Health Check Fix

Summary

This PR fixes a critical bug where the RuntimeCache was storing the cache key (e.g., "agentId:ws") as entry.agentId, but DbAdapterPool was looking up adapters using just the agentId. This mismatch caused health checks to always return true for webSearch-enabled runtimes, allowing stale DB connections to persist.

Code Quality & Best Practices ✅

Strengths:

  • Clean, focused fix that addresses the root cause
  • Improved logging that includes both the actual agentId and the cache key for debugging
  • Follows TypeScript best practices with proper type annotations
  • Maintains existing code structure and patterns

Observations:

  • The fix correctly separates concerns: cache key vs. agent ID for DB adapter lookup
  • Parameter naming is clear: cacheKey and actualAgentId make the distinction explicit

Correctness ✅

The fix is sound:

  1. RuntimeCache.set() now accepts 4 parameters: cacheKey, runtime, characterName, actualAgentId
  2. Stores actualAgentId in entry.agentId (used by DbAdapterPool.checkHealth())
  3. Uses cacheKey for Map storage (supports webSearch suffix: "agentId:ws")
  4. Call site at lib/eliza/runtime-factory.ts:426 correctly passes both values

Potential Issues & Considerations

1. API Breaking Change (Minor)
The set() method signature changed from 3 to 4 parameters. While this appears to be an internal class, verify:

  • Are there any other call sites to RuntimeCache.set() besides line 426?
  • Should this be considered a patch vs. minor version bump?

2. Type Safety Enhancement Opportunity
Consider making the distinction more explicit:

set(cacheKey: string, runtime: AgentRuntime, characterName: string, actualAgentId: UUID): void {

Could be enhanced with a comment or even better, using branded types for cache keys if this pattern expands.

3. Cache Invalidation Question
The delete() and has() methods still use agentId parameter names (lines 136, 145). These actually take cache keys. Consider:

  • Renaming the parameter to cacheKey for consistency
  • Or document that these methods accept cache keys, not agent IDs

Example from line 319-320:

const wasInMemoryBase = runtimeCache.delete(agentId);
const wasInMemoryWs = runtimeCache.delete(`${agentId}:ws`);

This shows delete() accepts cache keys, not agent IDs.

Performance Considerations ✅

No performance concerns:

  • Additional parameter has negligible overhead
  • Cache lookup logic unchanged
  • Health check behavior improved (prevents unnecessary work with stale connections)

Security Concerns ✅

No security issues identified:

  • UUID type safety maintained
  • No injection risks
  • Proper separation of cache key from agent ID prevents potential key confusion attacks

Test Coverage ⚠️

Missing tests:

  • No test files found for RuntimeCache or DbAdapterPool
  • This bug could have been caught with unit tests

Recommendation:
Add tests covering:

  1. RuntimeCache.set() with webSearch suffix ("agentId:ws")
  2. getWithHealthCheck() correctly passing entry.agentId to DbAdapterPool.checkHealth()
  3. Health check eviction when DB connection is stale
  4. Cache key vs. agent ID distinction

Test case example:

test('health check uses actual agentId, not cache key', async () => {
  const agentId = '123-456-789' as UUID;
  const cacheKey = `${agentId}:ws`;
  
  cache.set(cacheKey, mockRuntime, 'TestChar', agentId);
  
  // Should pass agentId to checkHealth, not cacheKey
  await cache.getWithHealthCheck(cacheKey, mockDbPool);
  
  expect(mockDbPool.checkHealth).toHaveBeenCalledWith(agentId);
});

Style & Documentation

Logging improvement:

`[RuntimeCache] Cached runtime: ${characterName} (${actualAgentId}, key=${cacheKey})`

This is excellent for debugging - clearly shows both values.

Documentation opportunity:
Consider adding a comment above the set() method explaining why both values are needed:

/**
 * Cache a runtime with the given cache key.
 * @param cacheKey - Cache key (may include suffix like ":ws")
 * @param actualAgentId - Agent UUID for DB adapter lookup (no suffix)
 */

Recommendation: ✅ APPROVE with suggestions

This is a solid bug fix that correctly addresses the health check bypass issue. The code is clean and the fix is minimal.

Suggested improvements (non-blocking):

  1. Add unit tests to prevent regression
  2. Rename agentId parameter to cacheKey in delete(), has(), and get() methods for consistency
  3. Add JSDoc comment explaining the distinction between cache key and agent ID
  4. Consider adding a validation that actualAgentId doesn't contain the suffix (defensive programming)

Great work identifying and fixing this subtle but important bug! 🎯

@claude
Copy link
Copy Markdown

claude Bot commented Jan 6, 2026

Code Review: Runtime Cache Health Check Fix

Summary

This PR correctly fixes a critical bug where health checks were bypassing stale DB connections due to a cache key mismatch between RuntimeCache and DbAdapterPool.

The Bug

Before: When web search was enabled, the cache key would be "agentId:ws", but this compound key was being stored in entry.agentId. The health check in line 104 (dbPool.checkHealth(entry.agentId)) would then look up "agentId:ws" in the DbAdapterPool, which stores adapters under plain "agentId". This lookup would fail to find the adapter and return true (meaning "healthy" at lib/eliza/runtime-factory.ts:236), bypassing the health check entirely.

After: The fix separates the cache key (used for Map indexing) from the actual agentId (stored in the entry), ensuring health checks lookup the correct adapter.


Review Findings

✅ Code Quality & Best Practices

  • Clean separation of concerns: Cache key vs actual agentId is now explicit
  • Improved logging: The debug message now shows both actualAgentId and cacheKey for better observability
  • Consistent naming: actualAgentId parameter name clearly indicates its purpose
  • Minimal changes: Only touches what's necessary to fix the bug

✅ Correctness

The fix is correct:

  1. RuntimeCache.set() now receives both the cache key and the actual agentId
  2. The cache entry stores the real agentId (without :ws suffix)
  3. Health checks properly lookup adapters using the clean agentId
  4. Cache retrieval still works via the compound key

⚠️ Potential Issues

1. Parameter Order Inconsistency
The set() method signature is:

set(cacheKey: string, runtime: AgentRuntime, characterName: string, actualAgentId: UUID)

Consider reordering to group related parameters:

set(cacheKey: string, actualAgentId: UUID, runtime: AgentRuntime, characterName: string)

This puts the two ID-related parameters together, making the relationship clearer. However, this is a minor style preference and the current implementation is acceptable.

2. Type Safety - No Breaking Change Detected
The signature change from:

set(agentId: string, runtime: AgentRuntime, characterName: string): void

to:

set(cacheKey: string, runtime: AgentRuntime, characterName: string, actualAgentId: UUID): void

This is a breaking change for any code calling this private method. ✅ However, I verified there's only one call site (lib/eliza/runtime-factory.ts:426), which is updated in this PR, so this is safe.

3. Edge Case: Other Methods Not Updated
The delete() and has() methods at lines 136-147 still take a single agentId parameter. They should receive the cache key (e.g., "agentId:ws"), not the raw agentId.

Let me verify if this is already handled correctly by checking the call sites...

Actually, looking at the code, these methods operate on the Map key, which should be the cache key. The parameter name agentId is misleading but the functionality appears correct since callers would pass the cache key. Consider renaming the parameter to cacheKey for clarity in a follow-up.

✅ Performance

  • No performance regression: The change only adds one additional parameter
  • Same lookup complexity: O(1) Map operations unchanged
  • Health check behavior improved: Now actually performs checks instead of bypassing

✅ Security

  • No security concerns identified
  • The fix actually improves reliability by ensuring stale connections are properly detected and evicted

❌ Test Coverage

Missing: No unit tests for RuntimeCache health check logic. Consider adding tests for:

  1. Health check with clean agentId when cache key has :ws suffix
  2. Verification that stale connections are properly evicted
  3. Cache key vs agentId separation

Example test case:

test('health check uses actualAgentId not cache key', async () => {
  const cache = new RuntimeCache();
  const runtime = createMockRuntime();
  const agentId = 'abc-123' as UUID;
  const cacheKey = 'abc-123:ws';
  
  cache.set(cacheKey, runtime, 'TestAgent', agentId);
  
  // Mock dbPool.checkHealth to verify it receives the clean agentId
  const mockDbPool = {
    checkHealth: jest.fn().mockResolvedValue(true)
  };
  
  await cache.getWithHealthCheck(cacheKey, mockDbPool);
  expect(mockDbPool.checkHealth).toHaveBeenCalledWith(agentId); // Not cacheKey!
});

Recommendation

✅ APPROVE with suggestions

This is a critical bug fix that should be merged. The implementation is correct and solves the stated problem.

Suggested Follow-ups (not blocking):

  1. Add unit tests for the health check logic
  2. Consider renaming agentId parameter in delete()/has() methods to cacheKey for clarity
  3. Add a comment explaining the cache key vs agentId distinction at the CachedRuntime interface

Great catch on this subtle but important bug! 🎯

@claude
Copy link
Copy Markdown

claude Bot commented Jan 6, 2026

Code Review - PR #192: Runtime Cache Health Check Fix

Summary

This PR fixes a critical bug where RuntimeCache health checks were bypassing stale database connections. The root cause was a key mismatch: cache keys included suffixes (e.g., "agentId:ws") but DbAdapterPool stored adapters under plain agentId, causing health checks to always return true.


✅ Strengths

  1. Correct Problem Identification: The fix properly addresses the key mismatch between cache keys and adapter pool keys
  2. Minimal Surface Area: Only 11 lines changed, reducing risk of introducing new bugs
  3. Backward Compatible: Changes are internal to the caching layer without breaking external APIs
  4. Improved Observability: Enhanced debug logging now shows both the actualAgentId and cacheKey

🐛 Issues Found

Critical: Missing Type Safety

Location: runtime-factory.ts:129

The actualAgentId parameter is typed as UUID but there's no runtime validation. If a non-UUID string is passed, it will silently fail health checks.

Recommendation:

async set(
  cacheKey: string,
  runtime: AgentRuntime,
  characterName: string,
  actualAgentId: UUID,
): Promise<void> {
  // Add validation
  if (!actualAgentId || typeof actualAgentId !== 'string') {
    throw new Error(`Invalid actualAgentId: ${actualAgentId}`);
  }
  // ... rest of implementation
}

Medium: Inconsistent Cache Key Usage

Locations: runtime-factory.ts:345-346, runtime-factory.ts:370

The invalidateRuntime() and isRuntimeCached() methods still use raw agentId instead of cacheKey, which could cause inconsistencies:

// Current code at line 345-346
const wasInMemoryBase = await runtimeCache.delete(agentId);
const wasInMemoryWs = await runtimeCache.delete(`${agentId}:ws`);

This works but creates duplication. Consider refactoring to use a helper:

private getCacheKeys(agentId: string): string[] {
  return [agentId, `${agentId}:ws`];
}

Low: Parameter Ordering

Location: runtime-factory.ts:125-129

The new signature has cacheKey first and actualAgentId last, which is counterintuitive since agentId is conceptually more fundamental than the cache key.

Suggestion: Consider reordering for better semantics:

async set(
  actualAgentId: UUID,
  cacheKey: string,
  runtime: AgentRuntime,
  characterName: string,
): Promise<void>

Though this is a minor point and not worth the churn if you prefer the current order.


🔍 Code Quality

Good:

  • Clean separation of concerns between cache key and agent ID
  • Descriptive variable names (cacheKey vs actualAgentId)
  • Improved logging for debugging

Concerns:

  • No input validation for the new parameter
  • No comments explaining the subtle distinction between cacheKey and actualAgentId

⚡ Performance

No performance concerns. The changes are purely logical with no additional overhead.


🔒 Security

No security concerns identified. The fix doesn't introduce any new attack vectors.


🧪 Test Coverage

Critical Gap: No tests found for RuntimeCache or this specific health check logic.

Recommendations:

  1. Add unit tests for RuntimeCache.set() with different cache keys
  2. Add integration tests for health check behavior with webSearch-enabled runtimes
  3. Add regression test for the specific bug: cache key "agentId:ws" should resolve to adapter with key "agentId"

Example test case:

describe('RuntimeCache health check', () => {
  it('should find adapter using actualAgentId when cache key has suffix', async () => {
    const agentId = 'test-agent-123' as UUID;
    const cacheKey = `${agentId}:ws`;
    
    await runtimeCache.set(cacheKey, runtime, 'Test', agentId);
    const result = await runtimeCache.getWithHealthCheck(cacheKey, dbPool);
    
    expect(result).toBeDefined();
    expect(dbPool.checkHealth).toHaveBeenCalledWith(agentId); // Not cacheKey!
  });
});

📝 Recommendations

Must Fix Before Merge:

  1. Add input validation for actualAgentId parameter
  2. Add inline comments explaining the cache key vs agent ID distinction
  3. Add basic unit tests (at minimum, test that health check uses actualAgentId)

Nice to Have:

  1. Refactor cache key generation into a helper method
  2. Consider adding TypeScript branded types to prevent passing wrong IDs
  3. Add JSDoc comments to the set() method explaining parameters

🎯 Verdict

Approve with Changes: The core fix is sound and addresses a real bug, but it needs:

  • Input validation
  • Documentation comments
  • Test coverage

The logic is correct, but production code handling critical infrastructure like database connection health should have stronger safeguards.


Generated by Claude Code

@claude
Copy link
Copy Markdown

claude Bot commented Jan 7, 2026

Code Review: PR #192 - Runtime Cache Health Check Fix

Summary

This PR fixes a critical bug where health checks were failing to detect unhealthy database connections for cached runtimes. The root cause was a key mismatch: RuntimeCache stored the composite cache key (e.g., "agentId:ws") as entry.agentId, but DbAdapterPool expected just the plain agentId for adapter lookups.

✅ Strengths

  1. Correct Problem Identification: The fix properly addresses the mismatch between cache keys and adapter keys
  2. Clear Parameter Naming: Using cacheKey vs actualAgentId makes the distinction explicit
  3. Improved Logging: The updated debug log now shows both the actual agentId and cache key for better observability
  4. Minimal Changes: The fix is surgical - only 4 lines changed with minimal surface area for new bugs
  5. Backwards Compatible: Existing cache entries will be recreated naturally as they expire

🐛 Potential Issues

1. Type Safety Concern (Minor)

The actualAgentId parameter is typed as UUID, but there's no runtime validation. Consider adding a validation check:

async set(
  cacheKey: string,
  runtime: AgentRuntime,
  characterName: string,
  actualAgentId: UUID,
): Promise<void> {
  if (!actualAgentId) {
    throw new Error('[RuntimeCache] actualAgentId is required');
  }
  // ... rest of implementation
}

2. Cache Invalidation Gap (Low Priority)

Existing cache entries with the old structure (entry.agentId = "agentId:ws") will fail health checks until they expire naturally. This is acceptable given the 30-minute max age, but worth noting for monitoring.

🔍 Code Quality

Positive:

  • Clear separation of concerns between cache key (for lookup) and agentId (for health checks)
  • Consistent with existing code style
  • Good use of debug logging for troubleshooting

Suggestions:

  • Consider adding JSDoc comments to clarify the distinction between cacheKey and actualAgentId:
/**
 * Caches a runtime instance.
 * @param cacheKey - The composite key for cache lookup (e.g., "agentId" or "agentId:ws")
 * @param runtime - The runtime instance to cache
 * @param characterName - Name of the character for logging
 * @param actualAgentId - The raw agent UUID for DB adapter health checks
 */

🔒 Security

No security concerns identified. The change doesn't introduce any new attack vectors.

⚡ Performance

No performance impact. The change is purely structural and doesn't add any additional operations.

🧪 Test Coverage

Missing Tests:

  • Unit test verifying health check works with webSearch-enabled runtimes ("agentId:ws" keys)
  • Integration test confirming unhealthy adapters are properly evicted
  • Test case for the logging output format

Recommended Test:

describe('RuntimeCache health checks', () => {
  it('should properly check health for webSearch-enabled runtimes', async () => {
    const agentId = 'test-agent-id' as UUID;
    const cacheKey = `${agentId}:ws`;
    
    await runtimeCache.set(cacheKey, mockRuntime, 'TestChar', agentId);
    
    // Mock unhealthy adapter
    dbPool.checkHealth = jest.fn().mockResolvedValue(false);
    
    const result = await runtimeCache.getWithHealthCheck(cacheKey, dbPool);
    expect(result).toBeNull();
    expect(dbPool.checkHealth).toHaveBeenCalledWith(agentId);
  });
});

🎯 Overall Assessment

Verdict:Approve with minor suggestions

This is a solid fix that correctly addresses the root cause. The implementation is clean and minimal. The suggested improvements are non-blocking:

  • Add parameter validation
  • Add JSDoc comments for clarity
  • Add test coverage for the fix

The PR can be merged as-is, with the suggestions implemented in a follow-up if desired.

Impact: High - Fixes a critical bug that prevented proper cleanup of stale database connections.


Review generated by Claude Code

@0xbbjoker 0xbbjoker merged commit 1890f46 into dev Jan 7, 2026
13 of 16 checks passed
@0xbbjoker 0xbbjoker deleted the fix/runtime-cache-health-check-key-mismatch branch January 7, 2026 21:25
lalalune pushed a commit that referenced this pull request Apr 28, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants