Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
26 commits
Select commit Hold shift + click to select a range
a26f4da
docs: outline approach for issue 6
Sakeeb91 Nov 11, 2025
39fa452
feat(health): add env-driven configuration builder
Sakeeb91 Nov 11, 2025
3f845f3
feat(health): make memory thresholds configurable
Sakeeb91 Nov 11, 2025
549f5b6
feat(health): support structured check definitions
Sakeeb91 Nov 11, 2025
488904c
feat(health): enforce configurable check timeouts
Sakeeb91 Nov 11, 2025
73c6345
feat(health): add cached health handler execution
Sakeeb91 Nov 11, 2025
a9960b9
feat(health): add structured health check logging
Sakeeb91 Nov 11, 2025
f95bb80
feat(health): add Prometheus-style metrics collector
Sakeeb91 Nov 11, 2025
dd9a154
feat(health): add circuit breaker support
Sakeeb91 Nov 11, 2025
45eac6b
feat(health): add remote health aggregation
Sakeeb91 Nov 11, 2025
d956ffa
feat(health): support graceful degradation
Sakeeb91 Nov 12, 2025
74da3cc
docs: describe advanced health capabilities
Sakeeb91 Nov 12, 2025
962502d
feat(services): enhance coding and transcription health
Sakeeb91 Nov 12, 2025
cc4a7fb
feat(documentation): aggregate downstream health
Sakeeb91 Nov 12, 2025
9f6b70d
feat(auth): expose full health endpoints
Sakeeb91 Nov 12, 2025
78a711b
fix(eslint): enhance module resolution for workspace packages
Sakeeb91 Nov 14, 2025
8c8da37
chore(deps): update lockfile to include @scribemed/health in auth ser…
Sakeeb91 Nov 14, 2025
803d875
fix(eslint): ignore workspace packages in import resolver
Sakeeb91 Nov 14, 2025
98ee960
style: fix import order to match ESLint rules
Sakeeb91 Nov 14, 2025
e799d56
style: format PR description with Prettier
Sakeeb91 Nov 14, 2025
962001a
fix(ci): build packages before type checking
Sakeeb91 Nov 14, 2025
83ade21
fix(api-gateway): use proper Express types in auth middleware
Sakeeb91 Nov 14, 2025
84590c5
chore(api-gateway): add @types/express dependency
Sakeeb91 Nov 14, 2025
18482be
fix(tests): specify test files explicitly to avoid glob expansion issues
Sakeeb91 Nov 14, 2025
de20050
refactor(health): rename getHealthMetricsSnapshot to getHealthMetrics
Sakeeb91 Nov 14, 2025
f9982e3
fix(ci): set up Docker Buildx for container image builds
Sakeeb91 Nov 14, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 12 additions & 0 deletions .eslintrc.js
Original file line number Diff line number Diff line change
Expand Up @@ -22,18 +22,30 @@ module.exports = {
settings: {
'import/resolver': {
typescript: {
alwaysTryTypes: true,
project: [
'./tsconfig.json',
'./packages/*/tsconfig.json',
'./services/*/tsconfig.json',
'./apps/*/tsconfig.json',
],
},
node: {
extensions: ['.js', '.jsx', '.ts', '.tsx'],
moduleDirectory: ['node_modules', 'packages', 'services', 'apps'],
},
},
'import/internal-regex': '^@scribemed/',
},
rules: {
'@typescript-eslint/no-unused-vars': ['error', { argsIgnorePattern: '^_' }],
'@typescript-eslint/no-explicit-any': 'warn',
'import/no-unresolved': [
'error',
{
ignore: ['^@scribemed/'],
},
],
'import/order': [
'error',
{
Expand Down
1 change: 0 additions & 1 deletion .github/PR_DESCRIPTION.md
Original file line number Diff line number Diff line change
Expand Up @@ -123,4 +123,3 @@ After merge, consider implementing enhancements from issue #6:
- Circuit breaker pattern
- Health check result caching
- Configuration flexibility

4 changes: 4 additions & 0 deletions .github/workflows/ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -33,6 +33,8 @@ jobs:
node-version: 20
- name: Install dependencies
run: pnpm install --frozen-lockfile
- name: Build packages
run: pnpm build
- name: Run ESLint
run: pnpm lint
- name: Check formatting
Expand Down Expand Up @@ -163,6 +165,8 @@ jobs:
contents: read
steps:
- uses: actions/checkout@v4
- name: Set up Docker Buildx
uses: docker/setup-buildx-action@v3
- name: Derive image metadata
id: image-meta
run: echo "repository=$(echo '${{ github.repository }}' | tr '[:upper:]' '[:lower:]')" >> "$GITHUB_OUTPUT"
Expand Down
3 changes: 3 additions & 0 deletions apps/api-gateway/package.json
Original file line number Diff line number Diff line change
Expand Up @@ -9,5 +9,8 @@
"test": "echo 'API gateway tests - to be implemented'",
"build": "echo 'API gateway build - to be implemented'",
"clean": "node -e \"require('fs').rmSync('dist', { recursive: true, force: true })\""
},
"devDependencies": {
"@types/express": "^4.17.21"
}
}
6 changes: 5 additions & 1 deletion apps/api-gateway/src/middleware/auth.ts
Original file line number Diff line number Diff line change
@@ -1,7 +1,8 @@
import { loadConfig } from '@services/auth/src/config/env';
import { createAuthMiddleware } from '@services/auth/src/middleware/auth.middleware';
import type { Request, Response, NextFunction } from 'express';

type Middleware = (req: unknown, res: unknown, next: () => void) => void;
type Middleware = (req: Request, res: Response, next: NextFunction) => void;

let guard: Middleware | null = null;

Expand All @@ -14,5 +15,8 @@ export function getAuthGuard(): Middleware {
const { authenticate } = createAuthMiddleware(loadConfig());
guard = authenticate;
}
if (!guard) {
throw new Error('Failed to initialize auth guard');
}
return guard;
}
30 changes: 21 additions & 9 deletions docs/issues/0006-health-check-enhancements.md
Original file line number Diff line number Diff line change
Expand Up @@ -114,6 +114,18 @@ The initial health check implementation (issue #5) provides a solid foundation w
- Implement result caching with TTL
- Add configuration options

## Implementation Approach

To close this issue we will evolve the `@scribemed/health` package rather than sprinkling bespoke logic in every service. The work will land in the following layers:

1. **Configuration primitives** – introduce typed options that can be hydrated from environment variables so every service can tune thresholds, cache TTLs, and timeouts without code changes.
2. **Execution pipeline** – normalize every health check definition, enforce per-check timeouts, and add short-lived caching to keep expensive checks from overwhelming shared dependencies.
3. **Observability hooks** – emit structured logs, expose Prometheus metrics, and annotate every response with timing metadata so operators can trace slow or failing checks quickly.
4. **Resilience patterns** – provide circuit breakers and dependency aggregation helpers (for downstream services) so issues in a single subsystem do not cascade through the platform.
5. **Graceful degradation** – allow non-critical checks to downgrade overall status to `degraded` instead of `unhealthy`, improving rollout safety in partial outage scenarios.

Each enhancement will ship with targeted tests and documentation updates to keep the health contract stable across the monorepo.

### Phase 2: Metrics Integration (2-3 days)

- Add Prometheus metrics export
Expand All @@ -128,21 +140,21 @@ The initial health check implementation (issue #5) provides a solid foundation w

## Acceptance Criteria

- [ ] Health checks have configurable timeouts
- [ ] Health check results are cached with configurable TTL
- [ ] Memory thresholds and timeouts are configurable via environment variables
- [ ] Health checks export Prometheus metrics
- [ ] Health check failures are logged with structured context
- [ ] Circuit breaker pattern implemented for external dependencies
- [ ] Documentation updated with new features and configuration options
- [ ] Tests added for new functionality
- [x] Health checks have configurable timeouts
- [x] Health check results are cached with configurable TTL
- [x] Memory thresholds and timeouts are configurable via environment variables
- [x] Health checks export Prometheus-style metrics
- [x] Health check failures are logged with structured context
- [x] Circuit breaker pattern implemented for external dependencies
- [x] Documentation updated with new features and configuration options
- [x] Tests added for new functionality

## Related Issues

- Issue #5: Implement Standardized Health Check System (completed)
- Issue #15: CI/CD Pipeline (metrics integration needed)

## Status: Open
## Status: In Review

## Notes

Expand Down
4 changes: 2 additions & 2 deletions package.json
Original file line number Diff line number Diff line change
Expand Up @@ -11,8 +11,8 @@
"dev": "turbo run dev",
"build": "turbo run build",
"test": "pnpm run test:unit && pnpm run test:integration",
"test:unit": "node --test \"tests/unit/**/*.js\"",
"test:integration": "node --test \"tests/integration/**/*.js\"",
"test:unit": "node --test tests/unit/config.test.js",
"test:integration": "node --test tests/integration/health.test.js",
"lint": "pnpm exec eslint . --ext .ts,.tsx,.js,.jsx --ignore-pattern dist --ignore-pattern build",
"format": "prettier --write \"**/*.{ts,tsx,js,jsx,json,md}\"",
"format:check": "prettier --check \"**/*.{ts,tsx,js,jsx,json,md}\"",
Expand Down
3 changes: 2 additions & 1 deletion packages/database/src/index.ts
Original file line number Diff line number Diff line change
@@ -1,7 +1,8 @@
import { SecretsManagerClient, GetSecretValueCommand } from '@aws-sdk/client-secrets-manager';
import { logger } from '@scribemed/logging';
import { Pool, PoolClient, QueryResult, QueryResultRow } from 'pg';

import { logger } from '@scribemed/logging';

/**
* Runtime database configuration resolved from environment variables or secrets.
*/
Expand Down
114 changes: 114 additions & 0 deletions packages/health/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -170,6 +170,108 @@ const healthHandler = createHealthHandler({
});
```

## Advanced Configuration

### Environment-driven options

Use `createHealthConfigFromEnv` to hydrate handler options from `process.env` so you can tune thresholds without editing code:

```javascript
const { createHealthConfigFromEnv, createHealthHandler } = require('@scribemed/health');

const healthHandler = createHealthHandler(
createHealthConfigFromEnv('my-service', {
checks: {
database: databaseCheck,
},
timeouts: { perCheck: { database: 1500 } },
})
);
```

The helper understands:

- `HEALTH_CHECK_TIMEOUT_MS`
- `HEALTH_CACHE_TTL_MS`
- `HEALTH_CACHE_ENABLED`
- `HEALTH_MEMORY_DEGRADED_PERCENT`
- `HEALTH_MEMORY_UNHEALTHY_PERCENT`

### Cache and timeout controls

Every handler caches results for a short TTL to avoid hammering shared dependencies. Set `cache.enabled` to `false` to disable, or provide a custom TTL:

```javascript
const handler = createHealthHandler({
serviceName: 'my-service',
cache: { ttlMs: 2000 },
timeouts: {
defaultMs: 1000,
perCheck: { database: 2000 },
},
checks: { database: databaseCheck },
});
```

### Circuit breakers for flaky dependencies

Wrap any expensive check with a circuit breaker by providing `impact` and `circuitBreaker` options. The breaker trips after repeated failures, short-circuits calls, then probes dependencies again after a cooldown:

```javascript
const redisCheck = {
run: async () => {
const healthy = await redis.ping();
return { status: healthy ? 'healthy' : 'unhealthy' };
},
impact: 'non-critical',
circuitBreaker: {
failureThreshold: 3,
cooldownPeriodMs: 10_000,
openStatus: 'degraded',
},
};

const handler = createHealthHandler({
serviceName: 'worker',
checks: { redis: redisCheck },
});
```

### Aggregate downstream services

`createRemoteHealthCheck` lets a gateway expose the health of services it depends on:

```javascript
const { createRemoteHealthCheck } = require('@scribemed/health');

const handler = createHealthHandler({
serviceName: 'api-gateway',
checks: {
transcription: createRemoteHealthCheck({
serviceName: 'transcription',
endpoint: 'http://transcription:8082/health',
timeoutMs: 1500,
}),
},
});
```

### Metrics export

The package keeps lightweight Prometheus-style metrics for every check. Call `getHealthMetricsSnapshot()` and expose the payload via `/metrics` to plug into your monitoring stack.

```javascript
const { getHealthMetricsSnapshot } = require('@scribemed/health');

app.get('/metrics', (_req, res) => {
res.type('text/plain').send(getHealthMetricsSnapshot());
});
```

### Critical vs non-critical impact

Set `impact: 'non-critical'` on optional dependencies. Failing non-critical checks mark the overall service as `degraded` instead of `unhealthy`, so rollouts can proceed while auxiliary systems recover.

## Kubernetes Integration

The health endpoints are designed to work with Kubernetes liveness and readiness probes:
Expand Down Expand Up @@ -212,6 +314,18 @@ Creates a database health check function.

Creates a memory usage health check function.

### `createHealthConfigFromEnv(serviceName: string, overrides?: Partial<HealthCheckOptions>)`

Builds a `HealthCheckOptions` object from environment variables (see the "Advanced Configuration" section).

### `createRemoteHealthCheck(options: RemoteHealthCheckOptions)`

Returns a check that calls another service's `/health` endpoint and maps the remote status into the local health response.

### `getHealthMetricsSnapshot()`

Returns the Prometheus-formatted metrics string for all recorded health checks.

## Testing

```bash
Expand Down
Loading
Loading