Skip to content

Commit 8491551

Browse files
committed
--wip--
1 parent 8516d23 commit 8491551

File tree

1 file changed

+81
-23
lines changed

1 file changed

+81
-23
lines changed

decouple-nats.md

Lines changed: 81 additions & 23 deletions
Original file line numberDiff line numberDiff line change
@@ -2,28 +2,48 @@
22

33
## Overview
44

5-
The goal is to decouple the NATs transport from the pipeline.
5+
The goal is to decouple the NATs transport from the dynamo runtime.
6+
Introduce abstractions for current NATs usages (e.g. KV router, event plane, request plane & object store, etc) which can be used to plug different implementations.
67

78
## Requirements
8-
99
- Ability to deliver messages across dynamo instances with at least once delivery guarantee.
10+
Ensure serialized bytes can be reliably delivered in p2p and broadcast modes.
11+
1012
- Ability to switch between transports at runtime.
1113

14+
### Event Plane
15+
- KV events needs to be delivered with at least once delivery guarantee. Currently KV events are handled in idempotent manner and redelivered messages are ignored.
16+
17+
#### Metrics
18+
19+
### Object Store
20+
21+
### Request Plane
22+
- need protocol for request context and control messages for request cancellation
1223

13-
## Durability guarantees
24+
### Message Delivery Guarantees
1425

15-
### At least once delivery
26+
#### At least once delivery (preferred)
1627
- No message loss is possible.
1728
- Message is delivered at least once to the consumers
29+
- consumers should be idempotent and be able to handle duplicate messages.
1830

19-
### Exactly once delivery
20-
- needs ack/nack coordination and stateful tracking of messages to ensure once delivery.
31+
#### Exactly once delivery
32+
- needs stateful tracking of messages and ack/nack coordination to ensure exactly once delivery.
2133

22-
### At most once delivery
34+
#### At most once delivery
2335
- Message loss is possible.
2436

25-
2637
## Current NATs use cases
38+
39+
## NATs use cases
40+
41+
### 1. NatsQueue python binding
42+
- **Location**: `lib/bindings/python/rust/llm/nats.rs` (`NatsQueue`)
43+
- **Functionality**:
44+
- Deprecated: We don't use `NatsQueue` python binding anymore. We use `NatsQueue` rust binding instead.
45+
- We can remove the python binding and the associated tests to simplify the codebase.
46+
2747
#### 2. JetStream-backed Queue/Event Bus
2848
- **Location**: `lib/runtime/src/transports/nats.rs` (`NatsQueue`)
2949
- **Functionality**:
@@ -65,25 +85,63 @@ The goal is to decouple the NATs transport from the pipeline.
6585
- Each service responds once to stats queries
6686

6787
## Proposal
88+
6889
- Use named message bus to publish and subscribe to messages.
69-
- Generalize ingress/egress components to support different transports
90+
- Support different transports (e.g. Raw TCP, Nats) for request/reply pattern.
91+
- Introduce abstractions for each NATs usage (e.g. KV router, Jet stream, object store, etc).
92+
93+
### Implementation
94+
- Phase 1
95+
* degraded feature set
96+
* not use KV router if they want. Best effort
97+
* nats
98+
* No HA guarantees for router
99+
* Operate without high availability w/ single router
100+
- Phase 2
101+
* explore transports (QUIC, Multicast)
102+
* durability
103+
* exactly once delivery
104+
105+
106+
## Generic Messaging Protocol
107+
Decouple messaging protocol from the underlying transport like Raw TCP, ZMQ or (HTTP, GRPC, and UCX active message).
108+
109+
Phase approach: start with ZMQ and Nats. Later, incrementally expand to support more advanced transports, ensuring that the protocol remains adaptable to requirements.
110+
111+
## Handshake and Closure Protocols:
112+
Robust handshake and closure protocols, including the use of sentinels or envelope structures to signal the end of streams or requests.
113+
A common semantic for closing requests and handling errors, will be generalized across different transports.
114+
115+
## Multipart Message Structure
116+
Use a multipart message structure, inspired by ZMQ's native multipart support, to encapsulate headers, body, and control signals (such as closure control signals or error notifications).
117+
118+
Extend existing two-part message structure to support N-part messages, making the protocol more flexible and expressive.
119+
120+
handshake protocols and message flows for key transports (Raw TCP, HTTP SSE, ZMQ, GRPC, UCX), distilling a protocol that works across all. They emphasized the value of starting with simple transports and expanding to more complex ones, ensuring the protocol can accommodate future needs and additional transports.
121+
122+
## Python-Rust Interoperability and Data Class Generation
123+
Strategies for improving Python-Rust interoperability, focusing on auto-generating Python data classes from Rust structs using Pydantic, and aligning message schemas to reduce manual coding and serialization errors.
70124

125+
### Support transports
126+
- Raw TCP
127+
- ZMQ
128+
- HTTP SSE
129+
- GRPC
130+
- UCX active messaging
131+
- Nats
71132

72-
## Alternative solutions
133+
## Milestones
134+
1. Implement abstractions for each NATs usage
73135

74-
### Message queue transports
75-
- ZeroMQ
76-
- Redis
77-
- Kafka
78-
- SQS
79-
- GCP PubSub
80-
- Azure Service Bus
136+
2. Implement different transports for request/reply pattern
137+
a. Interface for Request Plane
138+
b. sending requests over direct ZMQ
81139

82-
### Object store
83-
- S3
84-
- Redis
85-
- Shared filesystem
140+
3. Implement different transports for KV router
86141

87-
### Ideas
142+
4. Implement different transports for event bus
88143

89-
1. Use RocksDB / local storage to persist messages on producer side to guarantee at least once delivery.
144+
5. Object store:
145+
a. interface for object store
146+
b. object store implementation using shared filesystem
147+
c. object store implementation using model express

0 commit comments

Comments
 (0)