Skip to content

[Expert] Implement circuit breaker, retry with backoff, and graceful degradation for Stellar network calls #6

Description

@DeFiVC

Description

The API makes direct synchronous calls to Stellar Horizon and Soroban RPC with no retry logic, no circuit breaker, and no graceful degradation. During Stellar network issues, every request will block until timeout (60s), exhausting the connection pool and starving the Node.js event loop.

Problem Analysis

Current Stellar call paths

All external Stellar calls go through src/stellar/client.ts and src/stellar/transactions.ts:

Method File:Line External Call Timeout
getAccount client.ts:33-44 Horizon API None
submitTransaction client.ts:46-67 Horizon API None
callContract client.ts:70-86 Soroban RPC None
invokeContract transactions.ts:15-48 Soroban RPC (simulate + submit) 60s tx timeout
fundAccount accounts.ts:22-41 Horizon API None

What happens during Stellar outage

  1. Every reward claim, credential mint, and account funding call hangs
  2. Fastify's connection pool fills up with pending promises
  3. New requests queue behind the blocked ones
  4. After ~30s, clients timeout and retry, making it worse
  5. After ~60s, the transaction time bounds expire, but the HTTP request may still be pending
  6. The entire API becomes unresponsive

No retry for transient failures

Horizon returns HTTP 502/503 during brief outages. Currently these are treated as permanent failures.

No circuit breaker

There is no mechanism to short-circuit calls when Stellar is known to be down.

Required Implementation

A. Retry with Exponential Backoff

Install cockatiel (or implement manually) for retry policies:

// New file: src/stellar/resilience.ts
import { retry, circuitBreaker, timeout } from "cockatiel";

// Retry policy: 3 attempts, exponential backoff starting at 500ms
export const stellarRetry = retry(
  (details) => details.attemptNumber * 500,
  { maxAttempts: 3 }
);

// Circuit breaker: open after 5 consecutive failures, half-open after 30s
export const stellarCircuitBreaker = circuitBreaker(
  (err: Error) => {
    // Only break on network/transient errors, not business errors
    return err.message.includes("ECONNREFUSED") ||
           err.message.includes("ETIMEDOUT") ||
           err.message.includes("502") ||
           err.message.includes("503");
  },
  {
    halfOpenAfter: 30_000,
    breaker: {
      threshold: 5,
      duration: 60_000,
    },
  }
);

// Timeout: 10s for read operations, 30s for write operations
export const readTimeout = timeout(10_000);
export const writeTimeout = timeout(30_000);

B. Wrap Stellar Client Methods

// Updated src/stellar/client.ts
async submitTransaction(
  txEnvelope: StellarSdk.Transaction | StellarSdk.FeeBumpTransaction
): Promise<StellarSdk.Horizon.SubmitTransactionResponse> {
  return stellarCircuitBreaker.execute(() =>
    stellarRetry.execute(() =>
      writeTimeout.execute(async () => {
        try {
          const result = await this.horizon.submitTransaction(txEnvelope);
          if (result.status === "error") {
            throw new StellarError(result);
          }
          return result;
        } catch (err) {
          logger.error({ err }, "Stellar submitTransaction failed");
          throw err;
        }
      })
    )
  );
}

C. Graceful Degradation

For operations where Stellar availability is non-critical, degrade gracefully:

// In reward.service.ts
async claimReward(userId: string, submissionId: string) {
  // ... validation ...

  let txHash: string | null = null;
  let onChainSuccess = false;

  try {
    const result = await stellarClient.invokeContract(...);
    txHash = result.hash;
    onChainSuccess = true;
  } catch (err) {
    if (err instanceof CircuitBreakerError) {
      logger.warn({ submissionId }, "Stellar circuit breaker open — queuing reward");
      // Queue for later processing instead of failing
      await this.queueRewardForLater(submissionId, userId, score);
      return { success: true, queued: true, txHash: null };
    }
    throw err;
  }

  // ... DB updates ...
}

D. Background Retry Queue

For rewards that fail due to Stellar unavailability, implement a background queue:

// New file: src/services/retry-queue.ts
// Use Redis list as a simple FIFO queue
// Process with a background worker that retries every 30s
// Max retries: 10, then alert and mark as failed

E. Request Timeout on Fastify

// In server.ts
const app = Fastify({
  // ...
  hookTimeout: 15_000,     // 15s max for route handlers
  bodyLimit: 1048576,       // 1MB body limit
});

// Per-route timeout for Stellar-dependent endpoints
app.addHook("onSend", async (request, reply) => {
  if (request.url.includes("/rewards/claim") || request.url.includes("/credentials/mint")) {
    reply.header("X-Timeout", "30");
  }
});

F. Health Check with Dependency Verification

// Updated health endpoint
app.get("/health", async (request, reply) => {
  const checks = await Promise.allSettled([
    db.execute(sql`SELECT 1`),           // PostgreSQL
    redis.ping(),                         // Redis
    stellarClient.horizon.root(),        // Stellar Horizon
  ]);

  const status = checks.every(c => c.status === "fulfilled") ? "healthy" : "degraded";

  return reply.status(status === "healthy" ? 200 : 503).send({
    status,
    timestamp: new Date().toISOString(),
    uptime: process.uptime(),
    checks: {
      database: checks[0].status === "fulfilled" ? "ok" : "error",
      redis: checks[1].status === "fulfilled" ? "ok" : "error",
      stellar: checks[2].status === "fulfilled" ? "ok" : "error",
    },
  });
});

// Separate liveness probe (always 200 if process is running)
app.get("/health/live", async () => ({ status: "ok" }));

// Readiness probe (200 only if all deps are up)
app.get("/health/ready", async (request, reply) => {
  // ... same as /health but only returns 200 when fully healthy
});

Dependencies to Add

npm install cockatiel

Testing Requirements

  • Mock Stellar Horizon to return 502/503 and verify retry attempts
  • Mock Stellar to be completely unavailable and verify circuit breaker opens
  • Verify that after circuit breaker opens, requests fail fast (< 100ms)
  • Verify half-open state allows one probe request through
  • Test the background retry queue processes items when Stellar recovers
  • Load test: 100 concurrent reward claims with Stellar returning 503 — verify no crash

References

Metadata

Metadata

Assignees

Labels

GrantFox OSSIssue tracked in GrantFox OSSMaybe RewardedIssue may be eligible for a GrantFox rewardOfficial CampaignCampaign: Official CampaignadvancedAdvanced difficultyenhancementNew feature or requesttypescriptTypeScript language

Type

No type

Fields

No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions