Description
The API makes direct synchronous calls to Stellar Horizon and Soroban RPC with no retry logic, no circuit breaker, and no graceful degradation. During Stellar network issues, every request will block until timeout (60s), exhausting the connection pool and starving the Node.js event loop.
Problem Analysis
Current Stellar call paths
All external Stellar calls go through src/stellar/client.ts and src/stellar/transactions.ts:
| Method |
File:Line |
External Call |
Timeout |
getAccount |
client.ts:33-44 |
Horizon API |
None |
submitTransaction |
client.ts:46-67 |
Horizon API |
None |
callContract |
client.ts:70-86 |
Soroban RPC |
None |
invokeContract |
transactions.ts:15-48 |
Soroban RPC (simulate + submit) |
60s tx timeout |
fundAccount |
accounts.ts:22-41 |
Horizon API |
None |
What happens during Stellar outage
- Every reward claim, credential mint, and account funding call hangs
- Fastify's connection pool fills up with pending promises
- New requests queue behind the blocked ones
- After ~30s, clients timeout and retry, making it worse
- After ~60s, the transaction time bounds expire, but the HTTP request may still be pending
- The entire API becomes unresponsive
No retry for transient failures
Horizon returns HTTP 502/503 during brief outages. Currently these are treated as permanent failures.
No circuit breaker
There is no mechanism to short-circuit calls when Stellar is known to be down.
Required Implementation
A. Retry with Exponential Backoff
Install cockatiel (or implement manually) for retry policies:
// New file: src/stellar/resilience.ts
import { retry, circuitBreaker, timeout } from "cockatiel";
// Retry policy: 3 attempts, exponential backoff starting at 500ms
export const stellarRetry = retry(
(details) => details.attemptNumber * 500,
{ maxAttempts: 3 }
);
// Circuit breaker: open after 5 consecutive failures, half-open after 30s
export const stellarCircuitBreaker = circuitBreaker(
(err: Error) => {
// Only break on network/transient errors, not business errors
return err.message.includes("ECONNREFUSED") ||
err.message.includes("ETIMEDOUT") ||
err.message.includes("502") ||
err.message.includes("503");
},
{
halfOpenAfter: 30_000,
breaker: {
threshold: 5,
duration: 60_000,
},
}
);
// Timeout: 10s for read operations, 30s for write operations
export const readTimeout = timeout(10_000);
export const writeTimeout = timeout(30_000);
B. Wrap Stellar Client Methods
// Updated src/stellar/client.ts
async submitTransaction(
txEnvelope: StellarSdk.Transaction | StellarSdk.FeeBumpTransaction
): Promise<StellarSdk.Horizon.SubmitTransactionResponse> {
return stellarCircuitBreaker.execute(() =>
stellarRetry.execute(() =>
writeTimeout.execute(async () => {
try {
const result = await this.horizon.submitTransaction(txEnvelope);
if (result.status === "error") {
throw new StellarError(result);
}
return result;
} catch (err) {
logger.error({ err }, "Stellar submitTransaction failed");
throw err;
}
})
)
);
}
C. Graceful Degradation
For operations where Stellar availability is non-critical, degrade gracefully:
// In reward.service.ts
async claimReward(userId: string, submissionId: string) {
// ... validation ...
let txHash: string | null = null;
let onChainSuccess = false;
try {
const result = await stellarClient.invokeContract(...);
txHash = result.hash;
onChainSuccess = true;
} catch (err) {
if (err instanceof CircuitBreakerError) {
logger.warn({ submissionId }, "Stellar circuit breaker open — queuing reward");
// Queue for later processing instead of failing
await this.queueRewardForLater(submissionId, userId, score);
return { success: true, queued: true, txHash: null };
}
throw err;
}
// ... DB updates ...
}
D. Background Retry Queue
For rewards that fail due to Stellar unavailability, implement a background queue:
// New file: src/services/retry-queue.ts
// Use Redis list as a simple FIFO queue
// Process with a background worker that retries every 30s
// Max retries: 10, then alert and mark as failed
E. Request Timeout on Fastify
// In server.ts
const app = Fastify({
// ...
hookTimeout: 15_000, // 15s max for route handlers
bodyLimit: 1048576, // 1MB body limit
});
// Per-route timeout for Stellar-dependent endpoints
app.addHook("onSend", async (request, reply) => {
if (request.url.includes("/rewards/claim") || request.url.includes("/credentials/mint")) {
reply.header("X-Timeout", "30");
}
});
F. Health Check with Dependency Verification
// Updated health endpoint
app.get("/health", async (request, reply) => {
const checks = await Promise.allSettled([
db.execute(sql`SELECT 1`), // PostgreSQL
redis.ping(), // Redis
stellarClient.horizon.root(), // Stellar Horizon
]);
const status = checks.every(c => c.status === "fulfilled") ? "healthy" : "degraded";
return reply.status(status === "healthy" ? 200 : 503).send({
status,
timestamp: new Date().toISOString(),
uptime: process.uptime(),
checks: {
database: checks[0].status === "fulfilled" ? "ok" : "error",
redis: checks[1].status === "fulfilled" ? "ok" : "error",
stellar: checks[2].status === "fulfilled" ? "ok" : "error",
},
});
});
// Separate liveness probe (always 200 if process is running)
app.get("/health/live", async () => ({ status: "ok" }));
// Readiness probe (200 only if all deps are up)
app.get("/health/ready", async (request, reply) => {
// ... same as /health but only returns 200 when fully healthy
});
Dependencies to Add
Testing Requirements
- Mock Stellar Horizon to return 502/503 and verify retry attempts
- Mock Stellar to be completely unavailable and verify circuit breaker opens
- Verify that after circuit breaker opens, requests fail fast (< 100ms)
- Verify half-open state allows one probe request through
- Test the background retry queue processes items when Stellar recovers
- Load test: 100 concurrent reward claims with Stellar returning 503 — verify no crash
References
Description
The API makes direct synchronous calls to Stellar Horizon and Soroban RPC with no retry logic, no circuit breaker, and no graceful degradation. During Stellar network issues, every request will block until timeout (60s), exhausting the connection pool and starving the Node.js event loop.
Problem Analysis
Current Stellar call paths
All external Stellar calls go through
src/stellar/client.tsandsrc/stellar/transactions.ts:getAccountclient.ts:33-44submitTransactionclient.ts:46-67callContractclient.ts:70-86invokeContracttransactions.ts:15-48fundAccountaccounts.ts:22-41What happens during Stellar outage
No retry for transient failures
Horizon returns HTTP 502/503 during brief outages. Currently these are treated as permanent failures.
No circuit breaker
There is no mechanism to short-circuit calls when Stellar is known to be down.
Required Implementation
A. Retry with Exponential Backoff
Install
cockatiel(or implement manually) for retry policies:B. Wrap Stellar Client Methods
C. Graceful Degradation
For operations where Stellar availability is non-critical, degrade gracefully:
D. Background Retry Queue
For rewards that fail due to Stellar unavailability, implement a background queue:
E. Request Timeout on Fastify
F. Health Check with Dependency Verification
Dependencies to Add
Testing Requirements
References