Skip to content

feat: add OpenTelemetry distributed tracing (#252)#360

Open
menawar wants to merge 1 commit intoEDOHWARES:mainfrom
menawar:feat/issue-252-opentelemetry
Open

feat: add OpenTelemetry distributed tracing (#252)#360
menawar wants to merge 1 commit intoEDOHWARES:mainfrom
menawar:feat/issue-252-opentelemetry

Conversation

@menawar
Copy link
Copy Markdown

@menawar menawar commented Apr 27, 2026

Summary

Closes #252.

This PR introduces opt-in OpenTelemetry distributed tracing across the EventHorizon backend so the request flow from API → BullMQ queue → worker can be observed as a single trace in Jaeger, Zipkin, or any OTLP-compatible backend.

The feature is gated behind OTEL_ENABLED=true. With the flag off (the default), the SDK never starts and the helpers fall back to OpenTelemetry's built-in no-op tracer, so existing deployments see no behavioural or performance change until they explicitly opt in.

Approach

  • backend/src/config/tracing.js — single SDK initialization point. Must be required before any instrumented library, so it's the first import in server.js. Supports otlp (default), jaeger, console, and none exporters via OTEL_EXPORTER.
  • backend/src/utils/tracing.js — thin wrappers (withSpan, setAttributes, getCurrentTraceId, injectContextIntoCarrier, runWithExtractedContext) that work whether or not the SDK has been started.
  • backend/src/middleware/tracing.middleware.js — annotates the active HTTP span with route/user attributes and sets an x-trace-id response header for log correlation.
  • backend/src/worker/poller.js — wraps each polling cycle and per-contract loop in spans (stellar.poll.cycle, stellar.poll.contract).
  • backend/src/worker/processor.js — wraps job execution in worker.action.execute, extracting trace context from the job payload so the worker's span continues the producer's trace.
  • backend/src/worker/queue.js — injects the active trace context into job data on enqueue.
  • docker-compose.yml — adds an optional jaeger service behind an observability profile (docker compose --profile observability up jaeger).
  • backend/.env.example — documents the new env vars.
  • docs/observability.md — full setup, configuration reference, and span catalog.

Checklist of changes

  • backend/src/config/tracing.js — new SDK initializer with multi-exporter support and graceful error capture
  • backend/src/utils/tracing.js — span/attribute helpers and context propagation primitives
  • backend/src/middleware/tracing.middleware.js — Express middleware for span enrichment + x-trace-id header
  • backend/src/server.js — boot tracing first, log status, shut SDK down on SIGTERM
  • backend/src/app.js — install tracing middleware
  • backend/src/worker/poller.js — manual spans for poll cycle / per-contract scope
  • backend/src/worker/processor.jsworker.action.execute span + cross-process context extraction
  • backend/src/worker/queue.js — inject trace context into BullMQ job payload
  • backend/package.json — adds OpenTelemetry SDK, API, exporters, semantic conventions
  • backend/.env.example — new tracing env vars
  • docker-compose.yml — optional Jaeger service under the observability profile
  • backend/__tests__/tracing.test.js — 18 unit tests covering the new module
  • docs/observability.md — operator-facing documentation

Testing

Test suite (npm test --workspace=backend):

Before this PR After this PR
Tests 55 73
Pass 51 69
Fail 4 (pre-existing) 4 (same pre-existing)

The four failing tests (__tests__/queue.test.js, two in __tests__/trigger.controller.test.js, test/audit.test.js) all fail identically on main and are unrelated to this change. All 18 new tests pass.

New tests cover:

  • Default-disabled state (OTEL_ENABLED unset)
  • Case-insensitive flag parsing
  • Idempotent start() (returns same SDK on second call)
  • Init errors are captured, not thrown
  • withSpan returns the inner result, propagates errors, records exceptions
  • withSpan works with sync return values
  • setAttributes / getCurrentTraceId are safe no-ops outside any span
  • injectContextIntoCarrier preserves carrier identity
  • runWithExtractedContext runs the function with and without a carrier
  • End-to-end inject → extract round-trip
  • Tracing middleware sets x-trace-id and calls next()

Manual verification:

  • Booted the tracing module with OTEL_ENABLED=true, OTEL_EXPORTER=none, executed a withSpan call, confirmed a 32-char trace ID was generated and getCurrentTraceId() returned it from inside the span.
  • Verified tracing.shutdown() exits cleanly with no dangling network connections.
  • Confirmed OTEL_ENABLED=false (default) skips SDK initialization entirely (isInitialized() returns false).
  • Ran node --check against every modified file — all pass.

Notes

  • Tracing is opt-in by design — no behavioural change for current deployments. Operators turn it on by setting OTEL_ENABLED=true in .env.
  • The none exporter uses an in-memory span exporter so spans are still produced (and visible to test harnesses) but never sent over the network. This avoids NodeSDK's implicit fallback to a default OTLP exporter that would otherwise spew ECONNREFUSED to localhost:4318.
  • docs/ARCHITECTURE.md is referenced from the issue but does not yet exist in the repo; I documented this work in a new docs/observability.md instead. Happy to fold it into a future ARCHITECTURE.md when that lands.
  • Pre-existing test failures (queue, trigger controller, audit) are not addressed here — out of scope.

Follow-ups

  • Add OpenTelemetry metrics export (counters, histograms) to the same SDK once the tracing rollout is verified.
  • Consider trace-aware structured logging (inject traceId into every log line) once a structured logger replaces the current console.log shim in src/config/logger.js.

Wires the EventHorizon backend with OpenTelemetry so the request flow from
the API through BullMQ into the worker can be visualised as a single trace.

- Adds an opt-in tracing initializer in src/config/tracing.js that loads
  the SDK when OTEL_ENABLED=true and stays a complete no-op otherwise, so
  existing deployments are unaffected until they explicitly turn it on.
- Supports OTLP/HTTP (default), Jaeger, console, and in-memory exporters
  via OTEL_EXPORTER, with sensible localhost defaults.
- Adds withSpan / setAttributes helpers in src/utils/tracing.js and uses
  them to instrument the poller (stellar.poll.cycle / .contract) and the
  worker (worker.action.execute).
- Propagates trace context across the BullMQ queue via a _traceContext
  field, so a job's span continues the producer's trace.
- Annotates the active HTTP server span with route + user attributes and
  exposes the current trace id on the x-trace-id response header for log
  correlation.
- Adds a Jaeger profile to docker-compose.yml so devs can spin up a local
  collector with `docker compose --profile observability up jaeger`.
- Documents setup, configuration, and span list in docs/observability.md.
- Adds 18 unit tests covering the no-op path, env-driven configuration,
  span error recording, and context injection/extraction.
@drips-wave
Copy link
Copy Markdown

drips-wave Bot commented Apr 27, 2026

@menawar Great news! 🎉 Based on an automated assessment of this PR, the linked Wave issue(s) no longer count against your application limits.

You can now already apply to more issues while waiting for a review of this PR. Keep up the great work! 🚀

Learn more about application limits

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Backend: OpenTelemetry and Distributed Tracing Integration

1 participant