Zeitgeist A backend-neutral distributed inference protocol and runtime for heterogeneous local AI systems
Zeitgeist is an open protocol and modular runtime layer that allows different inference engines, hardware targets, and execution backends to participate in a shared distributed inference fabric.
Its purpose is to let systems built on:
- MLX
- vLLM
- llama.cpp
- TensorRT-LLM
- custom CUDA kernels
- custom Metal kernels
- custom local runtimes
interoperate through one common execution protocol.
Zeitgeist is not just a model-serving API. It is a full protocol for:
- capability discovery
- model identity
- backend negotiation
- execution planning
- tensor interchange
- KV-cache interchange
- distributed job orchestration
- failure handling
- transport framing
- modular backend and kernel extensibility
Zeitgeist is designed to be:
- open source
- protocol-first
- backend-neutral
- hardware-aware
- modular
- embeddable into our Mesh architecture
- usable by third parties independently of our product
Our project would integrate Zeitgeist natively as the shared execution layer, while still keeping our stronger control-plane, topology, and cooperative architecture.
- Allow heterogeneous backends to participate in a shared inference network.
- Standardize distributed inference interoperability across local runtimes.
- Make model serving portable across Apple Silicon, Linux GPU, consumer desktops, workstations, and custom hardware.
- Allow backend authors to plug into the protocol without rewriting the whole mesh stack.
- Allow runtime-specific optimizations without breaking network compatibility.
- Support both solo serving and distributed serving under one common protocol family.
- Standardize operator and orchestration surfaces.
- Enable backend-aware routing and scheduling.
- Provide a stable foundation for open ecosystem adoption.
- Make backend and kernel innovation compatible with shared infrastructure.
- Replacing every backend’s internal scheduler or executor.
- Forcing all runtimes into one internal implementation.
- Hiding all backend differences completely.
- Supporting every model architecture in the first protocol release.
- Guaranteeing zero-copy interchange across all runtimes in all cases.
Zeitgeist sits between:
- application-facing APIs
- and
- backend-specific execution engines
It is the interoperability plane.
- OpenAI-compatible APIs
- app SDKs
- orchestration layers
- management consoles
- cluster/pool coordination
- billing, quota, and policy layers
- MLX
- vLLM
- llama.cpp
- CUDA kernels
- Metal kernels
- CPU runtimes
- custom tensor engines
- custom quantization implementations
In our product, Zeitgeist becomes:
- the shared execution protocol
- the capability negotiation layer
- the backend and tensor interoperability contract
Our existing architecture remains stronger at:
- topology
- dispatch
- cooperative pooling
- recovery
- peer coordination
- data-plane design
Zeitgeist should have 4 major product surfaces:
Canonical open specification for wire format, capability model, execution semantics, and compatibility rules.
Reference implementation libraries for:
- Rust
- Python
- C/C++
- optional Swift for Apple ecosystem use
A modular runtime that implements the protocol and provides:
- transport
- capability registry
- execution planner
- backend adapters
- kernel registry
- debugging and observability tools
A compatibility suite that validates:
- backend conformance
- tensor correctness
- protocol compliance
- failure semantics
- distributed interoperability
User-facing inference APIs and SDKs.
Scheduling, job assignment, topology, policy, routing, and admission.
Backend-neutral interoperability plane.
Adapters for MLX, vLLM, llama.cpp, TensorRT-LLM, etc.
Backend-native kernels and custom pluggable kernel implementations.
CPU, Metal, CUDA, ROCm, Vulkan, etc.
A conformant Zeitgeist runtime should include:
- Capability Registry
- Model Registry
- Backend Manager
- Kernel Registry
- Execution Planner
- Session Manager
- KV Cache Manager
- Tensor Transport Manager
- Protocol Gateway
- Metrics/Tracing Layer
- Recovery Manager
- Compatibility Validator
Zeitgeist needs explicit protocol objects for all major runtime concepts.
A node advertises:
- node ID
- protocol version
- supported transports
- supported backends
- hardware profile
- memory profile
- trust/auth information
- runtime health
A backend descriptor includes:
- backend name
- backend version
- execution mode support
- supported model families
- supported quantization types
- supported dtypes
- supported attention variants
- supported cache formats
- supported tensor formats
- supported parallelism modes
- streaming support
- batching support
- custom extension support
Example:
mlxvllmllama_cpptensorrt_llmcustom/<vendor>/<runtime>
This is critical.
A model must be identified by more than a filename.
It needs:
- canonical model family
- architecture ID
- parameter count
- tokenizer ID/hash
- vocabulary hash
- positional encoding type
- rope/scaling params
- attention variant
- hidden size
- layer count
- expert config if MoE
- quantization schema
- tensor layout schema
- model artifact hash
- revision/build metadata
Without this, cross-backend interoperability becomes chaos.
Kernels must also be visible as protocol-level capabilities.
A kernel descriptor should include:
- kernel name
- implementation target
- operation type
- supported dtypes
- supported tensor layouts
- supported hardware
- precision characteristics
- determinism characteristics
- memory requirements
- optional vendor extensions
This allows custom kernels to plug in cleanly.
Zeitgeist must define capability negotiation as a first-class protocol phase.
An MLX node and a vLLM node cannot safely cooperate unless they agree on:
- model identity
- tensor schema
- cache schema
- distributed execution mode
- transport framing
- precision compatibility
Agree on protocol version, transport, auth mode, compression, framing.
Exchange backend, hardware, tensor, cache, and execution capabilities.
Exchange supported models and compatible model identities.
Determine whether nodes can participate in the same job and under what mode.
Commit to execution mode, partition plan, tensor schema, and cache schema.
Possible outcomes:
- fully compatible
- compatible with conversion
- compatible as API-only peer
- compatible only for solo serving
- incompatible
This distinction is essential.
Zeitgeist should support multiple execution modes, even if not all are implemented at once.
One node serves locally using its preferred backend.
Different nodes serve requests independently; orchestration routes per request.
Nodes share layers/tensors in a synchronized distributed execution graph.
Layers are partitioned across nodes.
Experts distributed across nodes with shared routing semantics.
Combination of tensor, pipeline, and expert parallelism.
Node does not execute model kernels; it only consumes the mesh.
This is the heart of the protocol.
Zeitgeist must define a canonical tensor interchange contract.
Each tensor exchange should include:
- tensor ID
- op context ID
- model/job/session ID
- tensor role
- shape
- dtype
- layout
- quantization descriptor
- endian/framing
- compression flag
- checksum
- sequence number
- optional chunk metadata
The protocol should define canonical layout families, for example:
- contiguous row-major
- contiguous column-major
- backend-specific blocked layouts
- quantized tile layouts
- sparse layouts
- extension layouts
Backends may use internal layouts, but protocol interchange needs canonical forms or declared conversion paths.
Must define:
- exact compatibility
- safe promotion
- safe demotion
- unsupported casts
- lossy conversion flags
Quantization must not be “just backend-specific magic.”
Zeitgeist needs a quantization schema model for:
- quant format name
- grouping/block size
- scale format
- zero-point format
- packing layout
- tensor-specific exceptions
- calibration metadata where relevant
This is one of the hardest parts.
A true cross-backend distributed protocol requires an explicit cache contract.
Must include:
- cache format version
- key/value dtype
- layout
- head grouping
- rope state assumptions
- sequence indexing semantics
- eviction semantics
- compression options
- backend-native extensions
All participating backends use the same canonical cache format.
Backends support import/export to canonical Zeitgeist cache format.
Backend can participate only when cache mobility is not required.
This allows graceful compatibility instead of pretending all caches are portable.
Zeitgeist should define explicit job types:
- chat completion
- text completion
- embedding
- ranking
- token verification
- speculative decode coordination
- model warmup
- cache export/import
- tensor op execution
- distributed shard execution
A session includes:
- session ID
- model identity
- tokenizer identity
- backend context
- execution mode
- cache state
- routing affinity
- consistency mode
- trace context
A standard lifecycle should include:
- proposed
- admitted
- planned
- assigned
- acknowledged
- executing
- streaming
- completed
- failed
- cancelled
- recovered
Zeitgeist should be transport-agnostic but define transport requirements.
- QUIC
- TCP
- Unix domain sockets
- shared memory transport
- in-process transport
- backend-specific high-speed plugins
Transport must support:
- streaming
- framing
- multiplexing
- backpressure
- ordered and optionally unordered flows
- checksums
- cancellation
- retry semantics
- optional compression
At minimum, define logical channels for:
- control
- capability negotiation
- tensor transport
- cache transport
- job state
- metrics/events
- debug/trace
Because Zeitgeist is open-source and likely widely embedded, auth must be modular.
- none
- shared token
- mTLS
- signed node identity
- backend-signed attestation
- extension auth providers
Define trust categories such as:
- trusted executor
- trusted cache peer
- trusted tensor peer
- API-only peer
- untrusted external client
Not every peer should be allowed into the same execution role.
This is essential.
A backend must implement something like:
- capability reporting
- model loading
- tokenizer binding
- inference execution
- tensor export/import
- cache export/import
- session state hooks
- metrics hooks
- error mapping
- shutdown/recovery behavior
- local serving backend
- distributed execution backend
- cache-capable backend
- tensor-export backend
- planner-only backend
- client-only backend
A backend does not need to support all classes.
- Apple Silicon optimized
- likely excellent solo/local serving backend
- maybe limited initially in distributed tensor role depending on cache/tensor support maturity
- Linux GPU serving
- high-throughput serving
- strong API backend
- likely excellent routed/solo backend
- distributed execution role depends on protocol adapter completeness
Vendors or researchers can plug in niche runtimes while still participating in the ecosystem.
Zeitgeist should support custom kernels as protocol-visible acceleration modules.
Users may want:
- custom attention kernels
- custom quant matmul kernels
- vendor-specific fused ops
- optimized MoE routing kernels
- experimental research kernels
Each kernel plugin should declare:
- op types
- supported tensor schemas
- supported dtypes
- target hardware
- determinism guarantees
- memory requirements
- compatibility with protocol layouts
- fallback path if unavailable
At execution planning time, Zeitgeist should choose:
- required kernel
- preferred kernel
- acceptable fallback kernel
- backend-native fallback
- protocol-level incompatibility if none available
Zeitgeist needs a planning layer, even if orchestration is external.
- determine compatibility set
- choose execution mode
- assign partitions
- choose transport schemas
- choose cache schema
- choose backend roles
- validate determinism/precision policy
- define fallback paths
Planner must consider:
- backend support
- hardware capability
- transport costs
- topology
- latency
- memory
- quantization compatibility
- kernel availability
- cache portability
- trust level
A real protocol needs explicit failure semantics.
- transport failure
- backend crash
- kernel failure
- incompatible tensor schema
- cache import failure
- peer timeout
- partition invalidation
- model mismatch
- unsupported conversion
- resource exhaustion
- retry same peer
- retry alternate peer
- degrade to solo
- degrade to routed serving
- replan partition
- rehydrate cache
- abort session
- surface partial failure with reason
Recovery should be configurable by:
- strict correctness mode
- best-effort mode
- low-latency mode
- deterministic mode
- high-availability mode
Zeitgeist should have first-class operator semantics, not leave everything to logs.
- node capabilities
- backend inventory
- model inventory
- current sessions
- current jobs
- transport health
- tensor throughput
- cache mobility status
- planner decisions
- fallback reasons
- failure events
Must support a live event stream for:
- joins/leaves
- planner decisions
- backend changes
- model load/unload
- kernel resolution
- failure and recovery events
- transport degradation
- cache portability failures
A management console should be able to render:
- topology
- backend types per node
- model compatibility state
- current execution plan
- active jobs
- fallback/recovery status
Zeitgeist itself should expose both machine-oriented and human/operator-oriented APIs.
Strict protocol endpoints for peers and orchestrators.
JSON and SSE surfaces for operator tools and dashboards.
Allows tooling to ask:
- can these nodes cooperate?
- under what execution mode?
- what conversion penalties apply?
- what kernel or backend limitations block execution?
This is extremely valuable.
Since Zeitgeist will be open-sourced as a protocol, ecosystem design matters.
- open protocol spec
- reference Rust SDK
- conformance suite
- backend adapter examples
- kernel plugin examples
- compatibility test harness
- sample management UI schema
Third parties should be able to add:
- custom backends
- custom kernels
- custom transports
- auth extensions
- model-family extensions
- quantization extensions
Need explicit versioning for:
- protocol version
- model descriptor version
- tensor schema version
- cache schema version
- backend plugin API version
- kernel plugin API version
To make Zeitgeist fully functional, we need all of the following categories.
- wire format
- versioning
- framing
- auth hooks
- transport abstractions
- node capabilities
- backend descriptors
- model identity descriptors
- kernel descriptors
- cache descriptors
- compatibility negotiation
- job lifecycle
- session lifecycle
- planning
- assignment
- tensor exchange
- cache exchange
- streaming output
- cancellation
- backend plugin API
- kernel plugin API
- transport plugin API
- extension registry
- health model
- failure semantics
- recovery semantics
- replayability
- observability
- SDKs
- docs
- conformance tests
- certification suite
- examples
You asked for the full spec, not a narrow V1, but a narrow V1 is still important.
A realistic narrow V1 should include:
- one protocol version
- one canonical tensor schema family
- one canonical cache schema
- one model family only
- MLX backend adapter
- one Linux backend adapter, likely vLLM or llama.cpp depending practicality
- solo mode
- routed serving mode
- one mixed-backend distributed execution mode
- strict capability negotiation
- no silent compatibility assumptions
- strong conformance suite
A narrow V1 should explicitly exclude:
- all model families
- all quantization formats
- all distributed modes
- arbitrary cache mobility
- arbitrary backend/kernel extensions in the first release
That narrowness is what makes the full long-term vision reachable.
Trying to standardize too much at once.
Claiming mixed-backend compatibility without hard canonical contracts.
Each backend evolves independently and breaks assumptions.
Cross-backend distributed failures can be extremely hard to debug.
Canonical interchange can introduce conversion or transport overhead.
If Zeitgeist works, it becomes a serious moat.
Why:
- backend-neutral execution is strategically powerful
- the ecosystem value compounds
- it prevents lock-in to one runtime
- it fits the democratized AI thesis
- it gives us the strongest long-term technical narrative
This is not a small feature. This is a platform.
Build Zeitgeist.
But treat it as a first-class product and protocol initiative with:
- its own spec
- its own modular runtime
- its own conformance suite
- its own extension system
- its own versioning and compatibility story
And most importantly:
Do not define it as “backend adapters for our product.”
Define it as:
- an open inference interoperability protocol
- which our product uses natively
- and which others can adopt independently
That gives us the right architecture, the right open-source posture, and the right long-term leverage.
If you want, next I can turn this into a concrete artifact in the repo as ZEITGEIST_SPEC.md, with:
- normative requirements
- recommended interfaces
- message schemas
- plugin traits
- protocol phases
- implementation roadmap
This checklist is the current source-of-truth status against this spec.
- ✅ Protocol-oriented Rust reference runtime exists.
- ✅ Canonical standalone normative protocol document exists in-repo.
- ⬜ Python SDK.
- ⬜ C/C++ SDK.
- ⬜ Swift SDK.
- ✅ Conformance-oriented test suite scaffold exists with deterministic and host-backed verification.
- ✅ Capability Registry.
- ✅ Model Registry.
- ✅ Backend Manager.
- ✅ Kernel Registry.
- ✅ Execution Planner.
- ✅ Session Manager.
- ⬜ KV Cache Manager with real cache import/export implementation.
- ⬜ Tensor Transport Manager with real binary interchange paths.
- ✅ Protocol Gateway.
- ✅ Metrics/Tracing layer via runtime events plus Fozzy artifacts/profile outputs.
- ✅ Recovery Manager with active failover execution.
- ✅ Recovery Manager with active replan execution.
- ⬜ Compatibility Validator as a dedicated external certification tool.
- ✅ Node identity object.
- ✅ Backend descriptor.
- ✅ Model identity descriptor.
- ✅ Kernel descriptor.
- ✅ Cache descriptor.
- ✅ Tensor envelope type.
- ✅ Job/session/result/event objects.
- ✅ Compatibility phases modeled through compatibility and planning APIs.
- ✅ Compatibility outcomes encoded explicitly.
- ✅ Exact-only protocol version policy enforced on negotiation and job submission endpoints.
- ✅ Authenticated handshake over a real TCP transport session.
- ✅ Solo execution planning.
- ✅ Routed serving planning.
- ✅ Pipeline-parallel planning surface.
- ✅ Tensor-parallel runtime execution.
- ✅ Expert-parallel runtime execution.
- ✅ Hybrid runtime execution.
- ✅ Client-only incompatibility outcome.
- ✅ Canonical tensor envelope schema defined in code.
- ✅ Canonical layout and dtype enums defined in code.
- ✅ Quantization descriptor schema defined in code.
- ✅ Canonical binary tensor framing with checksum validation.
- ✅ Tensor roundtrip verification API.
- ⬜ Binary tensor transport conversion across real heterogeneous backends.
- ⬜ Checksum-verified chunked tensor streaming.
- ✅ Cache descriptor schema defined in code.
- ✅ Transferability declared explicitly.
- ✅ Canonical cache serialization format.
- ✅ Cache roundtrip verification path.
- ⬜ Cache export/import execution paths across real backends.
- ⬜ Cache conversion engine across real backends.
- ✅ Job types defined.
- ✅ Job lifecycle states defined.
- ✅ Session summaries exposed in the management API.
- ✅ Deterministic synthetic job execution path.
- ✅ Streaming token output API.
- ✅ Cancellation API for in-flight jobs.
- ✅ Recovery flow.
- ✅ In-process and HTTP management transport surfaces exist.
- ✅ Transport health API.
- ✅ QUIC transport.
- ✅ TCP peer transport for handshake/capability/plan exchange.
- ✅ Unix domain socket transport.
- ⬜ Shared memory transport.
- ✅ Framed tensor/cache logical channels over peer transport.
- ✅ Auth mode and trust level types exist.
- ✅ Enforced shared-token auth.
- ✅ mTLS.
- ✅ Signed node identity.
- ✅ Backend attestation.
- ✅ Backend adapter trait.
- ✅ Live backends are the shipped default runtime path.
- ✅ Synthetic backend for deterministic certification.
- ✅ MLX-shaped adapter.
- ✅ vLLM OpenAI-compatible adapter with upstream served-model alias resolution.
- ✅ Kernel descriptor surface.
- ⬜ Dynamic plugin loading ABI.
- ⬜ Real llama.cpp adapter.
- ⬜ Real TensorRT-LLM adapter.
- ✅ Compatibility-aware planning.
- ✅ Fallback modes included in plans.
- ✅ Latency-aware deterministic policy input modeled.
- ✅ Topology-aware cost model.
- ✅ Memory-pressure-aware repartitioning.
- ✅ Trust-aware peer exclusion in live scheduling.
- ✅ Failure classes are modeled as explicit incompatible/error outcomes.
- ✅ Failed jobs are recorded with reasons.
- ✅ Operator cancellation path.
- ✅ Retry alternate peer.
- ✅ Degrade to solo/routed fallback automatically.
- ✅ Retry same peer.
- ⬜ Rehydrate cache.
- ✅ Partial failure surfacing for distributed execution.
- ✅ Node/backends/models/kernels/jobs/sessions APIs.
- ✅ Event log and SSE event stream.
- ✅ Fozzy artifact, memory, replay, CI, and profiling outputs.
- ✅ Transport health API.
- ✅ Planner decision audit API with persistent histories.
- ✅ Topology UI schema/API.
- ✅ Mesh peer registry and sync APIs.
- ✅ Protocol/management HTTP API exists.
- ✅ Compatibility API exists.
- ✅ Schema discovery endpoint exists.
- ✅ Tensor/cache roundtrip verification APIs exist.
- ✅ Peer-wire TCP endpoints for node-to-node negotiation/planning.
- ✅ Peer-wire TCP remote execution endpoint.
- ✅ Mesh management HTTP endpoints for peer registration and sync.
- ⬜ gRPC transport.
- ✅ Concise technical README centered on functionality.
- ✅ Open-source-friendly backend-neutral positioning in docs.
- ✅ Version policy explicitly states no backwards compatibility.
- ⬜ Separate conformance certification packaging.
- ⬜ Example third-party extension packages.
- ✅
cargo testpasses. - ✅ Quilt Linux
cargo testpass. - ✅ Fozzy deterministic doctor pass.
- ✅ Fozzy deterministic
testpass. - ✅ Fozzy deterministic
runwith recorded trace. - ✅ Fozzy
trace verifypass. - ✅ Fozzy
replaypass. - ✅ Fozzy
cipass. - ✅ Fozzy QUIC deterministic doctor/test/run/trace verify/replay/ci pass.
- ✅ Fozzy
explorepass. - ✅ Fozzy
fuzzpass. - ✅ Fozzy memory inspection commands exercised.
- ✅ Fozzy profile commands exercised.
- ✅ Fozzy corpus commands exercised.
- ✅ Host-backed live HTTP validation against a running runtime.
- ✅ Live QUIC peer handshake/capabilities/remote execution validation against a running runtime.
- ✅ Quilt Linux vLLM installation/import verification.
- ✅ Live MLX end-to-end inference through the real adapter.
- ✅ Live mesh remote execution through a second Zeitgeist node.
- ✅ Live mesh host-backed Fozzy doctor/test/run/trace verify/replay/ci path.
- ✅ Live vLLM end-to-end inference through a real remote server.
- ✅ Live default-config vLLM job execution through the Rust runtime.
- ✅ Live vLLM Fozzy deterministic doctor/test/run/trace verify/replay/ci path.