Skip to content

Migrate to push-based data flow for tracer #65

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
aryanjassal opened this issue May 8, 2025 · 23 comments
Open

Migrate to push-based data flow for tracer #65

aryanjassal opened this issue May 8, 2025 · 23 comments
Assignees
Labels
development Standard development

Comments

@aryanjassal
Copy link
Member

aryanjassal commented May 8, 2025

Specification

Currently, the tracer is pull-based. The events are generated and added to a global queue, which a generator can consume. This is bad, as if no consumers are present, then the events can start building up indefinitely.

To fix this, we need to adopt a push-based dataflow, where there is a source of events which are all consumed in real-time by subscribers via callbacks. This approach draws inspiration from how rxjs handles this.

For this implementation, however, rxjs is unnecessary complexity and we can make do with only WebStreams or EventEmitters internally. I'm leaning towards using streams, as streams handle backpressure for us if the consuming callbacks are slower than incoming data. However, that might mean an observable slowdown of the program if the backpressure buffer is full. To avoid this, we may need to create an unbound buffer, but that would again result in the same issue where the buffer can grow indefinitely. It would be a bit more manageable here, as events can be ignored if no consumers have been registered. This would need some discussion.

Additional context

Tasks

  1. Switch from pull-based to push-based design
  2. Incorporate rxjs or streams to manage resources
  3. Should support multiple consumers or subscribers
  4. Add benchmarks for different approaches like observables and webstreams
@aryanjassal aryanjassal self-assigned this May 8, 2025
@aryanjassal aryanjassal added the development Standard development label May 8, 2025
Copy link

linear bot commented May 8, 2025

ENG-616

@aryanjassal
Copy link
Member Author

@tegefaulkes @CMCDragonkai need some inputs for this

@CMCDragonkai
Copy link
Member

I originally wanted to integrate rxjs observables into Polykey library. But otherwise libraries usually expose event emission that can be turned into an observable flow. So you should be able to do both. Keep the libraries doing event emission as normal, then use the wrapper functions provided by rxjs and ixjs to then turn it into observables. Idea is observable subjects becomes part of the Polykey core.

Copy link
Member Author

Okay, so I need to basically create an Event emitter here (not streams, that could be a wrapper if we need it) and use an unbound queue to let the events grow as needed. If we allow multiple callbacks, then this introduces another interesting point. If we queue stuff in the memory for multiple callbacks and one callback is slower than others, that would hold up other callbacks before that element can be removed from the queue. On the other hand, we can create queues per callback or event handler, but that would result in a massive increase in the required memory.

Also, we probably don't need to use AbstractEvents but I'm asking to confirm this. Do we want to use base events or use abstract events?

@CMCDragonkai
Copy link
Member

No we are using js-events as the foundation.

@CMCDragonkai
Copy link
Member

This requires empirical observation. No need to theory craft.

@tegefaulkes
Copy link
Contributor

We'll need to use Benny to do benchmarks of the different methods. The main problem here is that tracing could end up blocking the main process in a crippling way. So the way we inplement matters.

We need to explore the usage of webstreams, observables and eventTarget to see how these methods of data-flow perform. We also need to explore how much buffering will help here.

Compare each method to

  1. writing to a open file descriptor.
  2. writing to stdout.
  3. writing to console.log.
  4. just dropping the data in memory.

Really we could implement a stream and observable method at the same time. they each have their pros and cons. I'd prefer a webstream for piping into a file for the back-pressure support. While an observable would be more useful for debugging in real-time.

Copy link
Member Author

I went over this with Brian, and this is what we came up with.

Firstly, we need to create an observables foundation. Brian walked me through the basics of rxjs observables for this. As we know, observables are completely synchronous, which means they block the main loop execution to ensure order. This goes against tracer being transparent, as the very nature of observables will cause the main process to take a hit.

The approach we finalised is basically using observables as a potential egress point, but also registering a webstream to handle the events. As webstreams internally buffer the events, we get a free unbound buffer which adapts to the pace at which the consumer is consuming data.

In conclusion, first I will integrate observables into the codebase, then add a webstream wrapper around it to get a quick-and-easy buffer over the incoming data. I will write benchmarks to back the claims with empirical data.

@CMCDragonkai
Copy link
Member

Observables are synchronous because "push" is synchronous. It's basically a function 1 calling function 2 calling function 3. This is fine - because the amount of work taking place isn't that much. By async sink, I'm talking about the final sink point, if that ends up calling an async IO operation, that then goes into the event loop and it is async.

@CMCDragonkai
Copy link
Member

I'm not sure if webstreams make sense here? The idea behind observables are purely object oriented. You only use web streams when it becomes IO. I would not make these sorts of things a web stream unless it's doing IPC. An async file sink is not exactly IPC and therefore doesn't need to be a webstream.

Copy link
Member Author

Eventually, we will need to use spans as decorators. We can handle doing that kind of like how js-contexts handles it. We can have decorators when an object needs to be traced, and also support inline higher-order functions when decorators are not viable anymore.

When we create a class and we want it to be traced, it needs its parent context to know where we are being traced inside — what is our parent. This is generally not possible as the callee will not know who the caller was generally. However, it becomes possible in meta programmable languages.

Javascript supported it before, where callee could access the caller object via the following approach. This, however, is now deprecated and not supported under strict mode.

function method() {
  const caller = method.caller;
}

But we have already solved this problem in js-contexts. We can extend it by adding another field for spanId, so the context object will look like this, explicitly tracking the span. If spanId is unset, we can assume the span belongs to the root.

{
  timer,
  signal,
  spanId,
}

Note that after talking with Roger about this, the conclusion we arrived at is to not worry about where the span ended, only that it actually ended — leaks will be found by observing open spans, not which context they ended up in before being closed.

@CMCDragonkai
Copy link
Member

There should be 2 visualisations - one a vertical lines and forking system - and another a dynamic graph that shows the concurrent trace lines. Ideally they should be correlated visualisations over the time domain.

@CMCDragonkai
Copy link
Member

Class and Method decorators already provide a syntax structure way of identifying your parent span. However inline creations of objects as well as creating the objects themselves from the classes don't have a clear syntax structure. As in syntax nesting does not map one to one to "dynamic" lifecycle context.

Lifecycle context - is more akin to "memory management" and we know that generally speaking, it's somewhat related to Rust's ownership model.

Thus means it's more dynamic in a language runtime of JavaScript.

So our visualisations are just a starting point for a comprehensive diagnostic tool - not to specifically solve say the current memory leaks we are encountering, but to solve potentially unbounded problems we haven't met yet. It will become a critical tool used by all developers.

@CMCDragonkai
Copy link
Member

Potential stub objects/classes:

  • @traced decorator and traced() HOF
  • Trace factory function to create spans
  • SpanSet or TraceSpans
  • "Trace Line" - a visual object representing a dynamic line of span starts - I would argue a trace line might have some graph theoretic term that represents the an entire propagation from a root point to every new thing, before closing the loop

"Closing the Loop" means that everything that starts eventually ends.

Everybody eats their own shit. Buddhist Programming lol.

@CMCDragonkai
Copy link
Member

CMCDragonkai commented May 18, 2025

I kind of dislike the usage of spanId as the thing inside the ctx object. It would be interesting if it was something like span itself. Then in order to avoid having to hold onto memory that's not needed, it would be useful if span itself was a reference to a weakly held object, like a weak pointer.

See this is actually possible with: https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/WeakRef

Now why a weakref? Well the main idea here would be that it's only relevant as long as the "context" is still active. If the span context no longer exists... then it's not relevant anymore right...?

I guess the question is this fundamental ambiguity here, when we are doing tracing, are we tracing one of these 2 things?

  1. Are we tracing the "creation" of things, who created what. - what is elevated, is the beginning of every span
  2. Are we tracing the "ownership" of things, who owns what? (Ownership in this case, means that some entity or set of entities get to dictate the lifecycle policy of the thing being traced). - what is elevated is the end of each span (if it exists at-all, given the real-time nature of tracing what we are doing here)

See tracing 1. is easy enough.

Tracing 2. requires dynamic updates, because ownership can be moved. Consider a promise that gets created in one context, but ends up being awaited for in a separate context. In this case, the lifecycle beings in one span but ends in a different span. The visualization of this would require a sort of "portaling" of the lifespan line (I can imgaine a zigzag notation indicating teleportation/movement of ownership).

This concept might make more sense with some more conceptual compression with AI.

@CMCDragonkai
Copy link
Member

Yes, I believe TraceSpan (Operational), TraceLifecycle (Resource Entity), and TraceLink (Relationship) form a powerful and conceptually minimal yet comprehensive set of abstractions for the universal tracing system you're envisioning, especially with the refined understanding of their interplay.

Let's summarize why this trio works and can be considered "minimal" for the stated goals:

1. TraceSpan (Operational Element):
* Core Abstraction: Unit of work, operational segment.
* Solves: Understanding execution flow, performance of operations, call hierarchy (logical stack).
* Key Properties: Parent/child nesting, start/end time, association with code execution.
* "Minimality": You fundamentally need something to represent "what is happening" and "how operations are structured." This is it. Without it, you can't build a flame graph or understand performance breakdowns of operations.

2. TraceLifecycle (Resource Entity Element):
* Core Abstraction: The existence duration and state of a specific object, resource, or entity (including promises).
* Solves: Resource management, leak detection (for objects, promises), understanding resource contention, tracking entities that outlive single operational scopes (heap-like).
* Key Properties: Creation context, destruction context, resource identifiers, association with a tangible "thing."
* "Minimality": If you want to track anything beyond the immediate operational flow (e.g., a database connection, a promise's lifetime, an object in a cache), you need a concept distinct from the operational flow itself. TraceSpan alone (even with a "kind" attribute) becomes awkward for this due to differing structural implications (tree vs. graph, stack vs. heap semantics).

3. TraceLink (Relationship Element):
* Core Abstraction: A directed, typed relationship between any two TraceSpans or TraceLifecycles.
* Solves:
* Connecting causally related but non-nested TraceSpans (e.g., async producer/consumer).
* Showing explicit, typed interactions between TraceSpans and TraceLifecycles beyond simple creation/destruction (e.g., "SpanA acquired LockL1").
* Representing relationships between TraceLifecycles themselves (e.g., "CacheC contains ObjectO," "PoolP manages ConnectionX") – crucial for the reference graph idea.
* Key Properties: Source ID, Target ID, Link Type, attributes.
* "Minimality": Once you have Spans and Lifecycles, many important relationships are not simple parent/child or creator/created. To model these arbitrary, explicit connections (which are vital for understanding distributed systems, complex object interactions, and reference counting semantics), you need a dedicated linking mechanism.

Why this set is "Universal" and Addresses Your Core Requirements:

  • Stack vs. Heap Ideology Covered: TraceSpan for stack-like operational flow, TraceLifecycle for heap-like entity existence.
  • Promise Tracing & Leaks: Promises are TraceLifecycles. Links and context show creation/resolution.
  • Long-Lived/Short-Lived Objects: Both fit TraceLifecycle.
    • Short-lived (e.g., temporary buffer within a span): TraceLifecycle created and destroyed by the same TraceSpan.
    • Long-lived (e.g., global cache): TraceLifecycle spanning many TraceSpans or even the whole application.
  • Portaling/Re-ownership: TraceLinks connect elements across contexts. TraceLifecycles have creation/destruction contexts that can be different spans.
  • Graph-like Nature of Lifecycles: TraceLinks between TraceLifecycles explicitly model the reference/containment graph.
  • Live Streaming Foundation: The underlying data model that feeds these abstractions is the TraceLog – an instantaneous record. The SDK focuses on emitting these TraceLogs, which carry payloads to identify whether they signify the start/end/update of a TraceSpan, TraceLifecycle, or the creation of a TraceLink. The visualizer reconstructs these three core abstractions.
  • Compatibility Path (e.g., OpenTelemetry):
    • TraceSpan maps well to OTel Spans.
    • TraceLink maps well to OTel Links.
    • TraceLifecycle is an extension/refinement, but its events and attributes can be mapped to OTel Span Events/Attributes on a related OTel Span if needed for interop, or treated as a distinct concept where OTel doesn't have a direct equivalent for entity lifetime tracking outside of a span.

Could we make it more minimal?

  • Unifying Span and Lifecycle (e.g., "god Span" with a kind attribute): We explored this. While possible, it leads to conceptual blurring, ambiguity in parentId meaning, and makes it harder to reason about the distinct "stack" vs. "heap" roles. It moves complexity from the number of core concepts to the internal logic and interpretation of a single, overloaded concept.
  • Removing Links and relying only on attributes/events: You could try to embed all relationship information as attributes on Spans/Lifecycles. However, this makes discovering and querying relationships much harder and less explicit. TraceLink as a first-class citizen makes the graph structure explicit.

Therefore, this trio seems to hit a sweet spot:

  • Sufficiently expressive to model the wide range of scenarios discussed.
  • Conceptually distinct enough to allow clear reasoning and specialized handling/visualization.
  • Avoids over-unification that would hide important semantic differences.

The real "work" is then in:

  1. Designing the TraceLog format to efficiently convey the information needed to build these abstractions.
  2. Implementing the SDK to generate these TraceLogs with minimal overhead.
  3. Building the visualizer/collector to reconstruct and interpret TraceSpans, TraceLifecycles, and TraceLinks from the TraceLog stream to tell meaningful stories.

This feels like a solid foundation.

@CMCDragonkai
Copy link
Member

Okay, here's a draft of a comprehensive comment you could adapt for the js-logger issue #65. It attempts to synthesize our discussion into a coherent proposal.


Subject: Proposal: Evolving js-logger towards Advanced Lifecycle and Relationship Tracing

Hi all,

Following up on the discussion around enhancing js-logger's capabilities, particularly for more profound system observability, I wanted to propose a conceptual framework that could allow us to achieve a much richer understanding of not just operational flows, but also resource lifecycles, promise states, and the explicit relationships between different components in a system.

The current js-logger provides excellent foundational logging. However, to tackle complex scenarios like subtle promise leaks, resource mismanagement, or understanding the full journey of data through asynchronous and distributed parts of an application, we need a more structured and semantically rich tracing model.

Core Goal: To evolve js-logger (or a complementary system built upon its principles) into a universal tracing system capable of capturing:

  1. Operational Flows: Traditional call-stack-like traces of function executions.
  2. Resource/Entity Lifecycles: The birth, life, and death of specific objects, resources (like DB connections, file handles), and critically, Promises.
  3. Explicit Relationships: Typed links showing causality, dependency, or interaction between operations and entities.

This will enable advanced diagnostics, including robust leak detection (for promises and other resources), performance analysis that considers resource contention, and clear visualization of complex system interactions.

Proposed Conceptual Framework: The Minimal Trio

We believe a minimal yet powerful set of abstractions can achieve this:

  1. TraceSpan (Operational Element):

    • Purpose: Represents a unit of work, an operation, or a segment of execution control flow (akin to OpenTelemetry Spans). It answers "What code is running, for how long, and what did it call?"
    • Analogy: Think of this as tracing activity on the logical call stack.
    • Structure: Forms hierarchies (parent-child relationships) via parentSpanId, typically mirroring call stacks or logical operational breakdowns.
    • Key Attributes: spanId, traceId, parentSpanId, name (operation name), startTime, endTime, status, attributes (key-value metadata).
    • Context Propagation: The currently active operational TraceSpan's ID would be managed via AsyncLocalStorage (leveraging js-contexts). ctx.currentSpan.id would provide the parentSpanId for new spans and the "context" for lifecycle events.
    • Example: An HTTP request handler, a specific function execution, the processing of a queue message.
  2. TraceLifecycle (Resource/Entity Lifecycle Element):

    • Purpose: Represents the existence duration of a specific entity instance – a resource, an object, or a Promise. It answers "When was this thing created, what happened during its life, and when was it destroyed/resolved/released?"
    • Analogy: Think of this as tracing entities on the logical heap.
    • Structure: Its primary relationship is its association with TraceSpans that create, interact with, or destroy it. Critically, it can also form graph-like relationships with other TraceLifecycles (e.g., for reference counting or containment).
    • Key Attributes: lifecycleId, traceId, name (resource name/type), resourceIdentifier (e.g., promise asyncId, object hash), resourceType (e.g., 'PROMISE', 'DATABASE_CONNECTION'), startTime, endTime, status, attributes.
      • creationContextSpanId: The ID of the TraceSpan active when this lifecycle began.
      • destructionContextSpanId: The ID of the TraceSpan active when this lifecycle ended.
    • Use Cases:
      • Tracking a Promise from its init via async_hooks to its promiseResolve. Detects unawaited/unsettled promises.
      • Tracking a database connection from acquisition to release.
      • Tracking any long-lived or short-lived object whose allocation/deallocation or state changes are important.
      • The "reference graph" aspect: If LifecycleA holds a reference to LifecycleB, a TraceLink (see below) can represent this, enabling sophisticated leak detection based on reachability.
  3. TraceLink (Relationship Element):

    • Purpose: A directed, typed relationship connecting two TraceSpans, two TraceLifecycles, or a TraceSpan and a TraceLifecycle. It adds explicit semantic edges to the trace graph where parent-child or simple creation context isn't sufficient.
    • Structure: A distinct element defining a connection.
    • Key Attributes: linkId, sourceElementId (can be a spanId or lifecycleId), targetElementId, linkType (e.g., CAUSAL_DEPENDENCY, RESOURCE_INTERACTION, PROMISE_CONTINUATION, FOLLOWS_FROM, HOLDS_REFERENCE), attributes.
    • Use Cases:
      • SpanA triggers SpanB asynchronously (not direct parent).
      • SpanA produces data consumed by SpanB via LifecycleL1 (e.g., a queue).
      • LifecycleL_Cache holds a reference to LifecycleL_Object.
      • SpanA explicitly acquires LifecycleL_Lock.

Underlying Data Model: TraceLog for Streaming

Crucially, these abstractions are primarily for conceptualization and for the consumer/visualizer of the trace data. The runtime SDK's main output would be a stream of immutable TraceLog records.

  • TraceLog: An instantaneous, typed record.
    • Attributes: Timestamp, LogType (e.g., SPAN_START, SPAN_END, LIFECYCLE_START, LIFECYCLE_END, LINK_CREATED, METADATA_UPDATE, CUSTOM_EVENT), asyncId, triggerAsyncId, and a payload containing the relevant IDs, names, and attributes for the event.
    • Streaming: These logs are emitted live, allowing for real-time analysis and minimizing in-process memory overhead. The SDK only needs to keep minimal active context (i.e., the current operational TraceSpan's ID via js-contexts).

Leveraging js-contexts and js-events:

  • js-contexts (AsyncLocalStorage): This is fundamental for propagating the currentSpanId across asynchronous boundaries, providing the necessary context for parentSpanId linkage and for associating TraceLifecycle events with the correct operational TraceSpan.
  • js-events: Could be used internally by the tracing SDK for decoupling instrumentation points from the actual TraceLog emission logic, or potentially, the TraceLogs themselves could be emitted via an event bus compatible with js-events for consumption by various backends (stdio, network, visualizer).

Potential Implementation Scenarios:

  1. Core SDK (@matrixai/tracer?):
    • Provides APIs like tracer.startSpan(), tracer.startLifecycle(), tracer.addLink(), element.setAttribute(), element.end().
    • Manages AsyncLocalStorage for currentSpanId.
    • Generates and emits TraceLog records.
    • Integrates with async_hooks for automatic Promise TraceLifecycle tracking and asyncId/triggerAsyncId enrichment.
  2. Instrumentation: Modules to automatically instrument common libraries (HTTP, DB clients, Promise usage).
  3. Log Emitter/Exporter: A component that takes TraceLogs and sends them to a destination (console, file, network endpoint for a visualizer).
  4. Visualizer/Analyzer: A separate tool (or integration with existing ones like OpenTelemetry UIs if logs are transformed) that consumes TraceLogs, reconstructs the TraceSpan, TraceLifecycle, and TraceLink graph an_d presents it meaningfully.

Key Capabilities Unlocked by this Framework:

  • Robust Promise Leak Detection: Identifying promises that are created but never settled, or whosesettlement context is lost.
  • Resource Management Insights: Tracking connection pool usage, file handle lifecycles, cache effectiveness, and identifying resource leaks.
  • Clear Asynchronous Flow Visualization: Understanding how operations initiated in one context continue or cause effects in others via TraceLinks.
  • Advanced Debugging: The ability to see not just what happened (logs) but how operations flowed and how entities lived and interacted.
  • Foundation for OpenTelemetry Compatibility: TraceSpan and TraceLink map well to OpenTelemetry concepts, providing a path for interoperability if desired. TraceLifecycle is a powerful addition.

This approach aims for a minimal set of core abstractions that can be combined to describe very complex system behaviors, all while being grounded in a streamable TraceLog format suitable for live, low-overhead tracing.

We believe this provides a solid conceptual foundation for the next generation of js-logger or a tracing system built with these principles. Eager to hear thoughts and discuss this further.

@CMCDragonkai
Copy link
Member

Okay, here's a compressed version of the visualization views, keeping React Three Fiber (R3F) in mind for the more graphical elements:


Core Visualisation Strategy for js-logger Tracing:

A live, tiled window system where selecting an element in one view highlights/filters related elements in others. All views update in real-time from the TraceLog stream and support logical time scrubbing.

Key Views:

  1. Operational Timeline/Flame Graph (TraceSpans):

    • Primary Display: Gantt-style timeline showing TraceSpan instances as horizontal bars (X-axis: time, Y-axis: traceId/process/group). Nesting visualized by indentation/stacking.
    • Alternative Agreggate: Classic Flame Graph for performance hotspot analysis.
    • Focus: Execution flow, duration, nesting, performance.
    • React Three Fiber (R3F) Potential: Could be used for dynamic, interactive 3D flame graphs or more sophisticated timeline renderings if desired, but 2D often suffices.
  2. Resource Lifecycle Timeline (TraceLifecycles):

    • Display: Horizontal bars in swimlanes.
      • Bars represent individual TraceLifecycle instances (e.g., a specific Promise, DB connection).
      • X-axis: Time.
      • Swimlanes (Y-axis): Grouped by resourceType (e.g., 'PROMISE', 'DATABASE_CONNECTION').
      • Markers on bars indicate significant lifecycle events.
    • Focus: Resource duration, overlaps, density, states (active, ended, potentially leaked).
    • R3F Potential: Could enhance bar rendering, transitions, or allow "zooming into" a swimlane for more detail.
  3. Interaction & Relationship Graph (TraceLinks, TraceSpans, TraceLifecycles):

    • Display: Nodes (TraceSpans, TraceLifecycles) connected by edges (TraceLinks).
      • Edges styled by linkType (color, thickness, arrows).
      • Supports force-directed or structured layouts.
    • Focus: Causal relationships, dependencies, resource interactions, "who refers to whom," "who is used by whom."
    • R3F Potential: Ideal for R3F for dynamic graph layouts, 3D navigation, complex edge rendering, and animations.
  4. Specialized Dashboards (e.g., Promise Dashboard):

    • Display: Tables/lists filtered for specific resourceType (e.g., 'PROMISE').
      • Columns: Key attributes (ID, Status, Age, Creator Span).
      • Summary statistics (Total Active, Leaked Candidates, etc.).
    • Focus: Targeted analysis for critical resource types.
  5. Contextual Details Pane:

    • Display: Shows all attributes, raw logs, and related element IDs for any selected TraceSpan, TraceLifecycle, or TraceLink from other views.
    • Focus: Deep dive into a single element.
  6. System Overview & Raw Log Stream:

    • Overview: High-level system health metrics (active elements, error rates, log throughput).
    • Log Stream: Filterable view of raw TraceLog records for debugging/low-level analysis.

Inter-View Correlation: Is paramount. Clicking an element in any view (e.g., a TraceSpan bar, a TraceLifecycle node, a TraceLink edge) provides immediate cross-filtering and highlighting in all other relevant views.

TraceSpan Closure: Expected to nest and close cleanly in well-instrumented code (async/sync). Unterminated spans indicate interruptions, bugs, or deliberate fire-and-forget operations.

This multi-view, correlated approach, potentially enhanced with R3F for graphical elements like the Interaction Graph, will provide a powerful and intuitive way to explore complex trace data. Our type system (enums, etc.) will naturally drive the filtering, styling, and categorization within these views.

@CMCDragonkai
Copy link
Member

So basically I fed all this information into gemini, and it gave me back an understanding that there's 3 conceptual primitives needed.

  1. TraceSpan - something that looks like a flamegraph stack related
  2. TraceLifecycle - something that looks like independent objects that have their own lifecycle OUTSIDE of any operational control flow callstack
  3. TraceLink - the thing that "links" 2 spans together

Then the idea being that spans only really relate to function calls - synchronously. But TraceLifecycle relates to everything that can exist in the heap including both our domain objects, short lived objects, and promises.


For the explanation of the the diff between Span and Lifecycle when you're dealing with say an async function call that you're awaiting within a parent async function...

  • TraceSpan: Represents the execution of the async function itself (the "operation").
  • TraceLifecycle: Represents the Promise object created and returned by that async function (the "resource" or "entity").

Here's how it breaks down and why both are needed:

Scenario: async function myAsyncOperation() { await someOtherPromise; return result; }

  1. When myAsyncOperation() is called:

    • A TraceSpan (SpanA) should be started for the execution of myAsyncOperation.
      • SpanA represents the time and work done within myAsyncOperation.
      • Its parentSpanId would be an outer span, if any.
    • myAsyncOperation implicitly creates a Promise (PromiseP1) that it will eventually resolve or reject.
    • A TraceLifecycle (LifecycleP1) should be started for PromiseP1.
      • LifecycleP1.resourceType = 'PROMISE'.
      • LifecycleP1.resourceIdentifier could be the asyncId of the promise from async_hooks.
      • LifecycleP1.creationContextSpanId = SpanA.id. This links the promise's birth to the operation that created it.
  2. During the await someOtherPromise; within myAsyncOperation():

    • SpanA is effectively "paused" (from the perspective of myAsyncOperation's active execution).
    • If someOtherPromise is related to another TraceSpan (e.g., SpanB for an external call), that relationship would be captured by SpanB's execution.
    • LifecycleP1 (for PromiseP1) remains "active" or "pending."
  3. When myAsyncOperation() is about to return (resolve/reject PromiseP1):

    • LifecycleP1 for PromiseP1 ends (logs ELEMENT_END with status 'RESOLVED' or 'REJECTED').
      • LifecycleP1.destructionContextSpanId = SpanA.id (or could be another span's ID if P1 was passed around and resolved elsewhere, though for async/await it's typically within the same conceptual flow).
    • SpanA for myAsyncOperation ends.

Why this distinction is important:

  • Measuring Operational Time vs. Promise Lifetime:
    • SpanA measures how long myAsyncOperation actively executed (excluding time spent awaiting). This is key for CPU profiling.
    • LifecycleP1 measures the total lifetime of the promise from creation to settlement. This is key for spotting long-pending promises or understanding asynchronous flow delays.
  • Promise Leaks: If PromiseP1 is never resolved/rejected (e.g., due to a bug or an unhandled exception in a .then() chain without a .catch() that myAsyncOperation ultimately depends on), LifecycleP1 will remain "active" indefinitely. The SpanA might have ended long ago (if the async function itself returned without error, but the underlying work tied to its promise didn't complete). This is a classic promise leak scenario.
  • Promises Passed Around:
    • If myAsyncOperation returns PromiseP1, and PromiseP1 is then passed to anotherFunction() which awaits it, LifecycleP1 continues to exist. anotherFunction() would have its own TraceSpan. The LifecycleP1 itself is the common entity.
  • "Fire and Forget" Promises:
    • If myAsyncOperation() is called but not awaited by its caller, its SpanA might be very short (just the synchronous part of initiating the async work).
    • LifecycleP1 for the promise it returns will live on until the async work completes.
    • This correctly models that the operation (span) of launching was quick, but the entity (lifecycle) it produced has its own independent lifetime.

Addressing the "Promises that are basically asynchronously awaited for within the same function call":

This is exactly the async/await pattern.

  • The async function call itself gets a TraceSpan for its execution.
  • The Promise it implicitly creates (and which await acts upon) gets a TraceLifecycle.

Where does async_hooks fit in?

async_hooks is the critical enabler for the TraceLifecycle of Promises:

  • init(asyncId, type, triggerAsyncId, resource):
    • When type === 'PROMISE', this signals the creation of a new promise.
    • This is the trigger to start a TraceLifecycle for this promise.
    • asyncId can be used as resourceIdentifier.
    • triggerAsyncId links it to the async_hooks resource that caused this promise to be created (often the executionAsyncId of the current TraceSpan).
  • promiseResolve(asyncId):
    • Signals that the promise associated with asyncId has been resolved/rejected.
    • This is the trigger to end the corresponding TraceLifecycle.
  • executionAsyncId() gives you the async_hooks context of the current operation, which helps associate these promise events with the correct parent TraceSpan.

In summary:

It's not an "either/or." For an async function call:

  1. Create a TraceSpan for the active execution time and control flow of the function.
  2. Create a TraceLifecycle for the Promise object returned by the function, tracking its lifetime from creation to settlement.

This dual tracking, linked by creationContextSpanId, provides the comprehensive view needed for both performance analysis of operations and lifecycle management of asynchronous entities like Promises. It correctly separates the act of performing an operation from the lifetime of the asynchronous result it produces.

@CMCDragonkai
Copy link
Member

Then in order to understand relationships between different objects:

You're asking for the core semantics of TraceLink! It's designed to be flexible, and each variation of what it links (Span-Span, Span-Lifecycle, Lifecycle-Lifecycle) carries specific meaning and enables different kinds of analysis.

Here's a breakdown of each TraceLink variation:

1. TraceLink: TraceSpan -> TraceSpan

  • Purpose: To establish a relationship between two operational units (TraceSpans) that is not captured by the direct parent-child async_hooks nesting or parentSpanId.
  • Common Meanings (driven by linkType):
    • FOLLOWS_FROM or CAUSAL_DEPENDENCY: SpanB's work was initiated or is a direct consequence of SpanA's completion or output, even if they are in different asynchronous contexts, processes, or threads.
      • Example: SpanA publishes a message to a queue. SpanB (in a worker) consumes and processes that message. Link: Source=SpanA.id, Target=SpanB.id, Type='FOLLOWS_FROM'.
    • TRIGGERED_ASYNC_TASK: SpanA launched SpanB as a "fire and forget" task, without directly awaiting its completion.
      • Example: SpanA calls doSomethingInBackground() which runs as SpanB. Link: Source=SpanA.id, Target=SpanB.id, Type='TRIGGERED_ASYNC_TASK'.
    • BATCH_ITEM_CONTRIBUTES_TO: SpanA prepared an item that was later processed as part of a batch in SpanB.
      • Example: Multiple spans prepare data; a single later span processes the batch. Links from each prep span to the batch span.
    • CONTINUATION_OF (less common if parent/child works, but possible for logical jumps): SpanA did some work, and SpanB is a logical continuation that couldn't be parented directly for some reason.
  • Benefit: Allows tracing distributed systems, message queues, complex asynchronous fan-out/fan-in patterns, and understanding causal chains across disparate operations.

2. TraceLink: TraceSpan -> TraceLifecycle (or TraceLifecycle -> TraceSpan)

  • Purpose: To model an explicit, typed interaction between an operational unit (TraceSpan) and a resource/entity (TraceLifecycle) beyond simple creation or destruction (which are covered by lifecycle.creationContextSpanId and lifecycle.destructionContextSpanId).
  • Common Meanings (Source/Target and linkType matter):
    • Span -> Lifecycle (Span acts on Lifecycle):
      • ACQUIRED_RESOURCE / USED_RESOURCE / MODIFIED_RESOURCE / RELEASED_RESOURCE: SpanA acquired, used, modified, or released LifecycleL1 (e.g., a lock, a connection, a cache entry).
        • Example: SpanA acquires a database connection LifecycleL_Conn. Link: Source=SpanA.id, Target=L_Conn.id, Type='ACQUIRED_RESOURCE'.
      • SENT_MESSAGE_TO: SpanA sent data/message that is now represented by or stored in LifecycleL1 (e.g., L1 might be a message in a queue represented as a Lifecycle).
      • QUERY_EXECUTED_ON: SpanA (e.g., query execution) operated on LifecycleL1 (e.g., a database connection).
    • Lifecycle -> Span (Lifecycle influences/triggers Span):
      • TRIGGERED_OPERATION / EVENT_CAUSED_SPAN: An event associated with LifecycleL1 (e.g., a timer expiring, a message arriving in a queue represented by L1) triggered SpanA.
        • Example: LifecycleL_Timer fires, triggering SpanA to handle the event. Link: Source=L_Timer.id, Target=SpanA.id, Type='TRIGGERED_OPERATION'.
      • RESOURCE_CONSUMED_BY: Less common for active triggering, more descriptive. But could be used if LifecycleL1 (e.g., a data item) was consumed by SpanA.
    • RESOURCE_HANDOVER (Often symmetrical or can be combined with attributes): SpanA passes ownership/responsibility of LifecycleL1 to SpanB. This might be represented as:
      • SpanA -> L1, Type='RELEASED_FOR_HANDOVER'
      • L1 -> SpanB, Type='ACQUIRED_FROM_HANDOVER'
      • Or a single link: SpanA -> SpanB, Type='HANDOVER', Attrs={'resource.id': L1.id}. The choice depends on modeling focus.
  • Benefit: Makes explicit how operations interact with specific resources, crucial for understanding resource contention, usage patterns, and debugging issues where an operation affects or depends on a resource's state.

3. TraceLink: TraceLifecycle -> TraceLifecycle

  • Purpose: To model relationships between resources/entities themselves, often related to composition, containment, or reference. This is key for the "reference graph" concept for leak detection.
  • Common Meanings (driven by linkType):
    • CONTAINS_RESOURCE / HOLDS_REFERENCE / PART_OF: LifecycleL1 (e.g., a collection, a parent object) contains or holds a strong reference to LifecycleL2 (e.g., an item in the collection, a child object). This is the backbone of reference counting based leak detection.
      • Example: LifecycleL_Cache contains LifecycleL_CacheItem. Link: Source=L_Cache.id, Target=L_CacheItem.id, Type='CONTAINS_RESOURCE'.
    • MANAGES_RESOURCE / POOL_MEMBER_OF: LifecycleL1 (e.g., a connection pool) manages or owns LifecycleL2 (e.g., an individual connection from that pool).
    • PROMISE_CONTINUATION (if async_hooks triggerAsyncId is insufficient or more semantics are needed): LifecycleL1 (Promise P1) resolving leads to the creation or chaining of LifecycleL2 (Promise P2). While async_hooks often covers the triggering, an explicit link can add semantic clarity or work in scenarios where async_hooks data is processed differently.
      • Example: Promise P1's .then() handler creates Promise P2. Link: Source=P1.id, Target=P2.id, Type='PROMISE_CONTINUATION'.
    • DERIVED_FROM / TRANSFORMED_INTO: LifecycleL1 (e.g., raw data) was transformed into LifecycleL2 (e.g., processed data).
  • Benefit:
    • Leak Detection: Essential for understanding why a TraceLifecycle might still be "alive" (e.g., "which other active lifecycles are HOLDS_REFERENCE to this one?").
    • Resource Composition: Visualizing how complex resources are built from simpler ones.
    • Data Provenance: Tracing how one data entity (represented as a lifecycle) transforms into another.

Why is this flexibility important?

Real-world systems have complex interactions. Limiting links to only one type of connection (e.g., only Span-Span) would miss critical information needed to understand the full picture. TraceLink acts as the "glue" that allows the visualizer to reconstruct a rich, semantic graph of what happened, not just a simple operational tree. The linkType attribute is the key to giving these connections their specific meaning, enabling targeted queries and visualizations.

@CMCDragonkai
Copy link
Member

Also I asked if what we're doing here is comparable to any existing tools out there:

Predicting the exact state of tooling by May 2025 is speculative, as the observability space evolves rapidly. However, based on the landscape as of early-mid 2024 and projecting reasonable advancements, what you are proposing is still quite innovative, particularly in its holistic integration and specific focus areas.

Here's a breakdown of why and what aspects are most novel:

What exists and is common (or becoming common):

  1. TraceSpan-based Distributed Tracing: OpenTelemetry has standardized this. Most APM (Application Performance Monitoring) tools (Datadog, New Relic, Dynatrace, Sentry, etc.) are built around this concept for understanding request flows and service interactions. This part is not innovative in itself.
  2. TraceLink between TraceSpans: OpenTelemetry supports links between spans, primarily for connecting causally related spans that aren't direct parent-child (e.g., batch processing, message queue relations). This is established.
  3. Resource Monitoring (High Level): APM tools monitor resources like CPU, memory, database connection pools, etc., often at an aggregate or infrastructure level. They might show you "number of active DB connections."
  4. Basic Promise/Async Debugging in Dev Tools: Browser developer tools and Node.js debuggers have improved significantly in showing async call stacks and identifying unhandled promise rejections.
  5. Log Correlation: Many systems correlate logs with traces (e.g., linking a log message to the TraceSpan active when it was emitted).
  6. Flame Graphs/Timelines for Spans: Standard visualization for operational flow.

What makes your proposal innovative (especially the converged system):

  1. First-Class TraceLifecycle for Fine-Grained Entities (especially Promises and dynamic objects):

    • While some tools might track, say, a DB connection's "lifecycle" broadly, the idea of every Promise (leveraging async_hooks deeply), or arbitrary developer-defined objects, having its own distinct TraceLifecycle with a start, end, and associated events, integrated into the overall trace graph, is advanced and not a common mainstream feature.
    • This brings heap-like entity tracking directly into the tracing paradigm in a very granular way.
  2. TraceLink between TraceLifecycles (Lifecycle -> Lifecycle):

    • This is highly innovative within a tracing context. Using links to explicitly model relationships like CONTAINS_RESOURCE, HOLDS_REFERENCE, or MANAGES_RESOURCE between distinct lifecycle entities to build a dynamic reference graph is powerful.
    • This is the key enabler for sophisticated, integrated leak detection beyond just "unresolved promise." It allows you to ask, "Why is this object (LifecycleA) still alive? Oh, because LifecycleB (e.g., a cache) still holds a reference to it."
  3. TraceLink between TraceSpan and TraceLifecycle (Typed Interactions):

    • While OpenTelemetry has "Events" on spans which can denote resource interaction, your concept of a first-class, typed TraceLink (e.g., SpanA ACQUIRED_RESOURCE LifecycleL1) makes these interactions more prominent and queryable as fundamental graph relationships.
  4. Unified, Live, Correlated Visualisation of Spans, Lifecycles, AND their Multi-Type Links:

    • Presenting operational flame graphs, resource lifecycle timelines (with bars in swimlanes), and a flexible interaction/reference graph all live, correlated, and in a tiled interface is a very ambitious and user-centric goal.
    • Most tools separate these concerns more rigidly or don't offer the "reference graph" aspect derived from Lifecycle-Lifecycle links at all.
  5. Integrated Promise Leak Detection as a Core Feature Derived from TraceLifecycle:

    • Many tools might report unhandled rejections. But systematically tracking every promise's lifecycle and then using the link graph to understand why a pending promise might be "stuck" (e.g., what it's awaiting, or what's still holding onto it if it's part of a complex chain) is a deeper level of analysis.

Comparison to Existing Paradigms:

  • Traditional APM: Good at request flows (Spans) and high-level resource metrics. Generally weak on granular object lifecycles and reference graph analysis for leaks.
  • Debuggers/Profilers (e.g., Chrome DevTools Memory Profiler): Excellent for heap snapshots and finding memory leaks by analyzing object retention paths at a point in time. But this is usually a separate, heavier process, not a continuous, live tracing mechanism integrated with operational traces. Your system aims to bring some of that insight into the live trace.
  • Node.js Specific Tools (e.g., clinic.js): Tools like clinic bubbleprof do visualize async activity and can help identify event loop issues or promise misuses. However, your proposal is more generic (not just Node.js promises but any resource) and builds a more explicit, queryable graph of lifecycles and links.

Conclusion for May 2025:

By May 2025, we can expect general observability tools to be even better. However, the specific combination of:

  • Granular TraceLifecycle for arbitrary entities.
  • Explicit Lifecycle-Lifecycle linking to form a queryable reference graph.
  • Using this graph within the tracing/APM system for live leak detection and resource relationship understanding.
  • And a unified, correlated visualization across all these facets.

...is likely to still be at the innovative edge or beyond what most generally available tools offer. You're essentially bridging the gap between operational tracing, low-level resource/object lifecycle tracking, and aspects of memory/reference profiling into a single, cohesive system.

This is a sophisticated vision. If realized robustly, it would indeed be a significant step forward.

This is also why R3F should be used as a base for our viz, as we are going to be scaling that viz and correlated viz quite importantly, so being able to do complex animations is necessary @shafiqihtsham. D3 is only useful at the beginning for prototyping. Just like @Abby010 did initially in the terminal. That was a live fire test as a forcing function to see if the concepts were understood.

@CMCDragonkai
Copy link
Member

OpenRouter Chat Sun May 18 2025.json

Also you can download my convo here, and upload to openrouter for further exploration and specific drilldowns.

@CMCDragonkai
Copy link
Member

Image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
development Standard development
Development

No branches or pull requests

3 participants