|
2 | 2 |
|
3 | 3 | ## Overview |
4 | 4 |
|
5 | | -The goal is to decouple the NATs transport from the pipeline. |
| 5 | +The goal is to decouple the NATs transport from the dynamo runtime. |
| 6 | +Introduce abstractions for current NATs usages (e.g. KV router, event plane, request plane & object store, etc) which can be used to plug different implementations. |
6 | 7 |
|
7 | 8 | ## Requirements |
8 | | - |
9 | 9 | - Ability to deliver messages across dynamo instances with at least once delivery guarantee. |
| 10 | + Ensure serialized bytes can be reliably delivered in p2p and broadcast modes. |
| 11 | + |
10 | 12 | - Ability to switch between transports at runtime. |
11 | 13 |
|
| 14 | +### Event Plane |
| 15 | +- KV events needs to be delivered with at least once delivery guarantee. Currently KV events are handled in idempotent manner and redelivered messages are ignored. |
| 16 | + |
| 17 | +#### Metrics |
| 18 | + |
| 19 | +### Object Store |
| 20 | + |
| 21 | +### Request Plane |
| 22 | +- need protocol for request context and control messages for request cancellation |
12 | 23 |
|
13 | | -## Durability guarantees |
| 24 | +### Message Delivery Guarantees |
14 | 25 |
|
15 | | -### At least once delivery |
| 26 | +#### At least once delivery (preferred) |
16 | 27 | - No message loss is possible. |
17 | 28 | - Message is delivered at least once to the consumers |
| 29 | +- consumers should be idempotent and be able to handle duplicate messages. |
18 | 30 |
|
19 | | -### Exactly once delivery |
20 | | -- needs ack/nack coordination and stateful tracking of messages to ensure once delivery. |
| 31 | +#### Exactly once delivery |
| 32 | +- needs stateful tracking of messages and ack/nack coordination to ensure exactly once delivery. |
21 | 33 |
|
22 | | -### At most once delivery |
| 34 | +#### At most once delivery |
23 | 35 | - Message loss is possible. |
24 | 36 |
|
25 | | - |
26 | 37 | ## Current NATs use cases |
| 38 | + |
| 39 | +## NATs use cases |
| 40 | + |
| 41 | +### 1. NatsQueue python binding |
| 42 | +- **Location**: `lib/bindings/python/rust/llm/nats.rs` (`NatsQueue`) |
| 43 | +- **Functionality**: |
| 44 | +- Deprecated: We don't use `NatsQueue` python binding anymore. We use `NatsQueue` rust binding instead. |
| 45 | +- We can remove the python binding and the associated tests to simplify the codebase. |
| 46 | + |
27 | 47 | #### 2. JetStream-backed Queue/Event Bus |
28 | 48 | - **Location**: `lib/runtime/src/transports/nats.rs` (`NatsQueue`) |
29 | 49 | - **Functionality**: |
@@ -65,25 +85,63 @@ The goal is to decouple the NATs transport from the pipeline. |
65 | 85 | - Each service responds once to stats queries |
66 | 86 |
|
67 | 87 | ## Proposal |
| 88 | + |
68 | 89 | - Use named message bus to publish and subscribe to messages. |
69 | | -- Generalize ingress/egress components to support different transports |
| 90 | +- Support different transports (e.g. Raw TCP, Nats) for request/reply pattern. |
| 91 | +- Introduce abstractions for each NATs usage (e.g. KV router, Jet stream, object store, etc). |
| 92 | + |
| 93 | +### Implementation |
| 94 | +- Phase 1 |
| 95 | + * degraded feature set |
| 96 | + * not use KV router if they want. Best effort |
| 97 | + * nats |
| 98 | + * No HA guarantees for router |
| 99 | + * Operate without high availability w/ single router |
| 100 | +- Phase 2 |
| 101 | + * explore transports (QUIC, Multicast) |
| 102 | + * durability |
| 103 | + * exactly once delivery |
| 104 | + |
| 105 | + |
| 106 | +## Generic Messaging Protocol |
| 107 | +Decouple messaging protocol from the underlying transport like Raw TCP, ZMQ or (HTTP, GRPC, and UCX active message). |
| 108 | + |
| 109 | +Phase approach: start with ZMQ and Nats. Later, incrementally expand to support more advanced transports, ensuring that the protocol remains adaptable to requirements. |
| 110 | + |
| 111 | +## Handshake and Closure Protocols: |
| 112 | +Robust handshake and closure protocols, including the use of sentinels or envelope structures to signal the end of streams or requests. |
| 113 | +A common semantic for closing requests and handling errors, will be generalized across different transports. |
| 114 | + |
| 115 | +## Multipart Message Structure |
| 116 | +Use a multipart message structure, inspired by ZMQ's native multipart support, to encapsulate headers, body, and control signals (such as closure control signals or error notifications). |
| 117 | + |
| 118 | +Extend existing two-part message structure to support N-part messages, making the protocol more flexible and expressive. |
| 119 | + |
| 120 | +handshake protocols and message flows for key transports (Raw TCP, HTTP SSE, ZMQ, GRPC, UCX), distilling a protocol that works across all. They emphasized the value of starting with simple transports and expanding to more complex ones, ensuring the protocol can accommodate future needs and additional transports. |
| 121 | + |
| 122 | +## Python-Rust Interoperability and Data Class Generation |
| 123 | +Strategies for improving Python-Rust interoperability, focusing on auto-generating Python data classes from Rust structs using Pydantic, and aligning message schemas to reduce manual coding and serialization errors. |
70 | 124 |
|
| 125 | +### Support transports |
| 126 | + - Raw TCP |
| 127 | + - ZMQ |
| 128 | + - HTTP SSE |
| 129 | + - GRPC |
| 130 | + - UCX active messaging |
| 131 | + - Nats |
71 | 132 |
|
72 | | -## Alternative solutions |
| 133 | +## Milestones |
| 134 | +1. Implement abstractions for each NATs usage |
73 | 135 |
|
74 | | -### Message queue transports |
75 | | - - ZeroMQ |
76 | | - - Redis |
77 | | - - Kafka |
78 | | - - SQS |
79 | | - - GCP PubSub |
80 | | - - Azure Service Bus |
| 136 | +2. Implement different transports for request/reply pattern |
| 137 | +a. Interface for Request Plane |
| 138 | +b. sending requests over direct ZMQ |
81 | 139 |
|
82 | | -### Object store |
83 | | - - S3 |
84 | | - - Redis |
85 | | - - Shared filesystem |
| 140 | +3. Implement different transports for KV router |
86 | 141 |
|
87 | | -### Ideas |
| 142 | +4. Implement different transports for event bus |
88 | 143 |
|
89 | | -1. Use RocksDB / local storage to persist messages on producer side to guarantee at least once delivery. |
| 144 | +5. Object store: |
| 145 | + a. interface for object store |
| 146 | + b. object store implementation using shared filesystem |
| 147 | + c. object store implementation using model express |
0 commit comments