feat: add OpenTelemetry distributed tracing (#252)#360
Open
menawar wants to merge 1 commit intoEDOHWARES:mainfrom
Open
feat: add OpenTelemetry distributed tracing (#252)#360menawar wants to merge 1 commit intoEDOHWARES:mainfrom
menawar wants to merge 1 commit intoEDOHWARES:mainfrom
Conversation
Wires the EventHorizon backend with OpenTelemetry so the request flow from the API through BullMQ into the worker can be visualised as a single trace. - Adds an opt-in tracing initializer in src/config/tracing.js that loads the SDK when OTEL_ENABLED=true and stays a complete no-op otherwise, so existing deployments are unaffected until they explicitly turn it on. - Supports OTLP/HTTP (default), Jaeger, console, and in-memory exporters via OTEL_EXPORTER, with sensible localhost defaults. - Adds withSpan / setAttributes helpers in src/utils/tracing.js and uses them to instrument the poller (stellar.poll.cycle / .contract) and the worker (worker.action.execute). - Propagates trace context across the BullMQ queue via a _traceContext field, so a job's span continues the producer's trace. - Annotates the active HTTP server span with route + user attributes and exposes the current trace id on the x-trace-id response header for log correlation. - Adds a Jaeger profile to docker-compose.yml so devs can spin up a local collector with `docker compose --profile observability up jaeger`. - Documents setup, configuration, and span list in docs/observability.md. - Adds 18 unit tests covering the no-op path, env-driven configuration, span error recording, and context injection/extraction.
|
@menawar Great news! 🎉 Based on an automated assessment of this PR, the linked Wave issue(s) no longer count against your application limits. You can now already apply to more issues while waiting for a review of this PR. Keep up the great work! 🚀 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Closes #252.
This PR introduces opt-in OpenTelemetry distributed tracing across the EventHorizon backend so the request flow from API → BullMQ queue → worker can be observed as a single trace in Jaeger, Zipkin, or any OTLP-compatible backend.
The feature is gated behind
OTEL_ENABLED=true. With the flag off (the default), the SDK never starts and the helpers fall back to OpenTelemetry's built-in no-op tracer, so existing deployments see no behavioural or performance change until they explicitly opt in.Approach
backend/src/config/tracing.js— single SDK initialization point. Must be required before any instrumented library, so it's the first import inserver.js. Supportsotlp(default),jaeger,console, andnoneexporters viaOTEL_EXPORTER.backend/src/utils/tracing.js— thin wrappers (withSpan,setAttributes,getCurrentTraceId,injectContextIntoCarrier,runWithExtractedContext) that work whether or not the SDK has been started.backend/src/middleware/tracing.middleware.js— annotates the active HTTP span with route/user attributes and sets anx-trace-idresponse header for log correlation.backend/src/worker/poller.js— wraps each polling cycle and per-contract loop in spans (stellar.poll.cycle,stellar.poll.contract).backend/src/worker/processor.js— wraps job execution inworker.action.execute, extracting trace context from the job payload so the worker's span continues the producer's trace.backend/src/worker/queue.js— injects the active trace context into job data on enqueue.docker-compose.yml— adds an optionaljaegerservice behind anobservabilityprofile (docker compose --profile observability up jaeger).backend/.env.example— documents the new env vars.docs/observability.md— full setup, configuration reference, and span catalog.Checklist of changes
backend/src/config/tracing.js— new SDK initializer with multi-exporter support and graceful error capturebackend/src/utils/tracing.js— span/attribute helpers and context propagation primitivesbackend/src/middleware/tracing.middleware.js— Express middleware for span enrichment +x-trace-idheaderbackend/src/server.js— boot tracing first, log status, shut SDK down onSIGTERMbackend/src/app.js— install tracing middlewarebackend/src/worker/poller.js— manual spans for poll cycle / per-contract scopebackend/src/worker/processor.js—worker.action.executespan + cross-process context extractionbackend/src/worker/queue.js— inject trace context into BullMQ job payloadbackend/package.json— adds OpenTelemetry SDK, API, exporters, semantic conventionsbackend/.env.example— new tracing env varsdocker-compose.yml— optional Jaeger service under theobservabilityprofilebackend/__tests__/tracing.test.js— 18 unit tests covering the new moduledocs/observability.md— operator-facing documentationTesting
Test suite (
npm test --workspace=backend):The four failing tests (
__tests__/queue.test.js, two in__tests__/trigger.controller.test.js,test/audit.test.js) all fail identically onmainand are unrelated to this change. All 18 new tests pass.New tests cover:
OTEL_ENABLEDunset)start()(returns same SDK on second call)withSpanreturns the inner result, propagates errors, records exceptionswithSpanworks with sync return valuessetAttributes/getCurrentTraceIdare safe no-ops outside any spaninjectContextIntoCarrierpreserves carrier identityrunWithExtractedContextruns the function with and without a carrierx-trace-idand callsnext()Manual verification:
OTEL_ENABLED=true,OTEL_EXPORTER=none, executed awithSpancall, confirmed a 32-char trace ID was generated andgetCurrentTraceId()returned it from inside the span.tracing.shutdown()exits cleanly with no dangling network connections.OTEL_ENABLED=false(default) skips SDK initialization entirely (isInitialized()returnsfalse).node --checkagainst every modified file — all pass.Notes
OTEL_ENABLED=truein.env.noneexporter uses an in-memory span exporter so spans are still produced (and visible to test harnesses) but never sent over the network. This avoidsNodeSDK's implicit fallback to a default OTLP exporter that would otherwise spewECONNREFUSEDtolocalhost:4318.docs/ARCHITECTURE.mdis referenced from the issue but does not yet exist in the repo; I documented this work in a newdocs/observability.mdinstead. Happy to fold it into a future ARCHITECTURE.md when that lands.Follow-ups
traceIdinto every log line) once a structured logger replaces the currentconsole.logshim insrc/config/logger.js.