A single-thread async HTTP + WebSocket server on Linux io_uring, in Zig 0.16.0.
IO thread (io_uring Ring A + fiber):
├── accept/read/write CQE → fiber → handler → respond
├── drain user SubmitQueues
├── drain Next.go() ringbuffer tasks
├── drain DeferredResponse / InvokeQueue → respond
├── drainTick (DNS tick + invoke.drain + tick_hooks)
├── FiberShared.tick() — harvest outbound rings (Ring B/C/D...)
└── TTL incremental scan (StackPool live list)
Worker pool (optional, offload CPU/GPU/blocking I/O):
└── Next.submit() → worker thread → compute → InvokeQueue → IO thread drains
Handlers run as fibers on the IO thread by default.
Next.go()— fiber on IO thread, zero thread switch. Use for DB io_uring, async I/O.Next.submit()— worker pool. Use only for CPU-intensive computation that would block.
- Linux 5.1+ (io_uring)
- Zig 0.16.0
git clone https://github.com/fndome/sws
cd sws
zig build runconst sws = @import("sws");
pub fn main() !void {
var server = try sws.AsyncServer.init(alloc, io, "0.0.0.0:9090", null, 64);
defer server.deinit();
server.GET("/hello", myHandler);
try server.run();
}The entire event loop runs on one IO thread. Handlers execute as fibers (user-space coroutines) on the same thread.
IO thread (single):
io_uring.submit_and_wait(1)
→ CQE dispatch (via StackPool sticker)
→ fiber → handler → ctx.text/json/html
→ drainPendingResumes (fiber resume queue)
→ drainNextTasks (Next.go ringbuffer tasks)
→ drainTick (DNS tick + invoke.drain + tick_hooks)
→ FiberShared.tick() (outbound Ring B/C/D harvest)
→ TTL scan (StackPool live list, incremental)
→ loop
No background threads unless you call server.initPool4NextSubmit(n).
Connections are stored in a pre-allocated array (not a hash map). O(1) acquire/release via freelist.
StackPool<StackSlot, 1_048_576>
├── slots: [1M]StackSlot — contiguous, cache-line-aligned
├── freelist: [1M]u32 — O(1) pop/push
├── live: []u32 — active slot indices (TTL scan source)
└── warmup() — touch all pages to eliminate cold-start faults
Each connection slot is split across independent cache lines for contention-free hot-path access:
line1 ( 64B): fd, gen_id, state, write_offset, req_count — CQE dispatch (hottest)
line2 ( 64B): conn_id, last_active_ms, active_list_pos — TTL scanning
line3 ( 64B): fiber_context, large_buf_ptr — async anchors, Worker Pool, oversized body
line4 (128B): response_buf, write_iovs, ws_write_queue — write path (low frequency)
line5 ( 64B): sentinel (0x53574153) + workspace union — HTTP/WS/Compute view
Ghost event defense: user_data = (gen_id << 32) | idx. After close, gen_id is zeroed. Any in-flight CQE arriving after close fails the gen_id match and is silently discarded.
Workspace switching: The line5.ws union switches between HttpWork, WsWork, and ComputeWork views depending on connection state — no heap allocation for protocol parsing state.
Ring A (built-in): the main server's io_uring ring — accept, connection read/write, DNS, invoke.
Outbound rings (Ring B, Ring C...): independent io_uring rings for outbound protocols. Registered with FiberShared, which the IO thread harvests every event loop iteration — zero extra threads, zero locks.
Ring A (main server, IO thread):
├── accept / read / write / close
├── io_registry (client callbacks)
├── dns_resolver (async UDP DNS)
├── rs.invoke (cross-thread push → IO thread callback)
└── FiberShared.tick()
├── Ring B (HTTP client) : ATTACH_WQ → shared io-wq
│ ├── DnsResolver
│ ├── IORegistry
│ ├── InvokeQueue
│ └── TinyCache
├── Ring C (NATS client futures)
└── Ring D (MySQL client futures)
Pure scheduling glue layer. Holds tick handles for all outbound rings. IO thread calls tick() every loop iteration to non-blocking harvest CQEs from all registered rings.
const FiberShared = @import("sws").FiberShared;
const RingTrait = @import("sws").RingTrait;
var fiber_shared = try FiberShared.init(allocator);
defer fiber_shared.deinit();
// Ring B (HTTP client) registers itself
try ring_b.registerWith(&fiber_shared);
// Any new ring just implements RingTrait:
// ptr: *anyopaque — pointer to ring instance
// tickFn: fn(*anyopaque) void — dns.tick + invoke.drain + submit + copy_cqes + dispatchContract that every outbound ring must implement:
dns.tick()— drive DNS query state machineinvoke.drain()— process cross-thread task dispatchring.submit()— submit pending SQEsring.copy_cqes()— non-blocking harvest CQEsregistry.dispatch(ud, res)— dispatch CQEs to registered callbacks
var server = try AsyncServer.init(alloc, io, "0.0.0.0:9090", app_ctx, fiber_stack_size_kb);
// ↑ 0 = 64KBFirst handler/middleware registration calls ensureNext() → creates Next (ringbuffer) + setDefault().
Internally, AsyncServer.init() creates:
pool: StackPool — O(1) contiguous connection arraylarge_pool: LargeBufferPool(64) — 64 × 1MB blocks for oversized requests (>32KB)rs: RingShared — single ring shared resource (ring + registry + invoke)io_registry: IORegistry — outbound client connection registrydns_resolver: DnsResolver — async UDP DNS with TTL cache
To add outbound rings (HTTP/NATS/MySQL), inject FiberShared:
const FiberShared = @import("sws").FiberShared;
const RingB = @import("sws").HttpRing; // same as RingB
const HttpClient = @import("sws").HttpClient;
var fiber_shared = try FiberShared.init(alloc);
defer fiber_shared.deinit();
// Ring B with 1s built-in TinyCache TTL:
var ring_b = try RingB.init(alloc, io, server.ring.fd, 1000);
defer ring_b.deinit();
try ring_b.registerWith(&fiber_shared);
// HttpClient auto-uses RingB's built-in cache:
var http_client = try HttpClient.init(alloc, &ring_b);
defer http_client.deinit();
// Set fiber_shared on server so IO thread harvests outbound CQEs
server.fiber_shared = &fiber_shared;fn hello(allocator: Allocator, ctx: *Context) anyerror!void {
ctx.text(200, "hello");
}For async I/O (DB io_uring, HTTP client):
const Ctx = struct { allocator: Allocator, resp: *DeferredResponse };
fn exec(c: *Ctx, complete: *const fn (?*anyopaque, []const u8) void) void {
defer c.allocator.destroy(c);
defer c.allocator.destroy(c.resp);
c.resp.json(200, "[{\"id\":1}]");
complete(c, "");
}
fn myHandler(allocator: Allocator, ctx: *Context) anyerror!void {
const s: *AsyncServer = @ptrCast(@alignCast(ctx.server.?));
const resp = try allocator.create(DeferredResponse);
resp.* = .{ .server = s, .conn_id = ctx.conn_id, .allocator = allocator };
ctx.deferred = true;
Next.go(Ctx, .{ .allocator = allocator, .resp = resp }, exec);
}For offload work (crypto, compression, LLM/GPU inference, blocking I/O):
const Ctx = struct { allocator: Allocator, resp: *DeferredResponse };
fn exec(c: *Ctx, complete: *const fn (?*anyopaque, []const u8) void) void {
defer c.allocator.destroy(c);
defer c.allocator.destroy(c.resp);
// Offload work here (CPU/GPU/blocking I/O)...
c.resp.json(200, "{\"done\": true}");
complete(c, "");
}
fn myHandler(allocator: Allocator, ctx: *Context) anyerror!void {
const s: *AsyncServer = @ptrCast(@alignCast(ctx.server.?));
const resp = try allocator.create(DeferredResponse);
resp.* = .{ .server = s, .conn_id = ctx.conn_id, .allocator = allocator };
ctx.deferred = true;
Next.submit(Ctx, .{ .allocator = allocator, .resp = resp }, exec);
}try server.initPool4NextSubmit(1); // 1 worker thread (recommended)Recommendations:
1— default, sufficient for crypto, compressionN/2(e.g. 4 on 8-core) — sustained LLM/GPU inference or blocking I/O
Sends HTTP response from any thread (CAS-based lock-free):
resp.json(200, "{\"ok\":true}");
resp.text(200, "plain");Execute custom logic before each deferred response is sent, on the IO thread. Essential for MMORPG / real-time use cases (update game state, leaderboard, broadcast):
fn updateGameState(server: *AsyncServer, node: *DeferredNode) void {
const world: *GameWorld = @ptrCast(@alignCast(server.app_ctx.?));
world.update(node.body);
}
try server.addHookDeferred(updateGameState);Rules:
- Hooks run in registration order on the IO thread — safe for IO-thread-exclusive data
node.bodyis valid during hook execution; do NOT free it- Do NOT store
nodepointer — the node is destroyed after the hook returns - Must not panic (log errors instead)
Rooms with countdown → auto-battle for hundreds of players. Two hooks cooperate:
addHookTick checks deadlines every loop iteration (no deferred node needed);
addHookDeferred processes incoming player commands.
Battle CPU work offloaded via Next.submit. Zero locks — all state on IO thread.
const Room = struct {
id: u64,
state: enum { waiting, fighting, settle },
deadline: i64, // monotonic timestamp
teams: [2]std.ArrayList(*Player),
};
const Player = struct { id: u64, hp: u32, atk: u32 };
const BattleCtx = struct {
blue_team: []PlayerSnapshot,
red_team: []PlayerSnapshot,
};
const PlayerSnapshot = struct { hp: u32, atk: u32 };fn roomTick(server: *AsyncServer) void {
const app: *GameApp = @ptrCast(@alignCast(server.app_ctx.?));
for (app.rooms.items) |*room| {
if (room.state == .waiting and server.monotonic_ms() >= room.deadline) {
room.state = .fighting;
startBattle(server, room);
}
}
}
fn roomCommand(server: *AsyncServer, node: *DeferredNode) void {
const app: *GameApp = @ptrCast(@alignCast(server.app_ctx.?));
app.processCommand(node.body); // join / ready / action
}
fn startBattle(server: *AsyncServer, room: *Room) void {
const ctx = server.allocator.create(BattleCtx) catch return;
ctx.blue_team = snapshotTeam(&room.teams[0], server.allocator) catch return;
ctx.red_team = snapshotTeam(&room.teams[1], server.allocator) catch return;
Next.submit(BattleCtx, ctx, doBattle);
}
fn doBattle(ctx: *BattleCtx, complete: *const fn (?*anyopaque, []const u8) void) void {
const result = simulateCombat(ctx.blue_team, ctx.red_team);
var buf: [4096]u8 = undefined;
const json = result.toJson(&buf);
server.sendDeferredResponse(room_id, 200, .json, json);
_ = complete;
}
try server.addHookTick(roomTick); // tick: fires every IO loop
try server.addHookDeferred(roomCommand); // deferred: fires per-player commandNext.go(Ctx, ctx, exec); // fiber on IO thread (io_uring I/O)
Next.submit(Ctx, ctx, exec); // worker pool (offload work)Both are static. Next.go works out of the box (auto setDefault on first route). Next.submit requires server.initPool4NextSubmit(n).
GPU compute uses Next.submit — worker thread calls CUDA / CANN / Vulkan runtime.
io_uring direct dispatch for GPU is blocked on Linux kernel drivers (missing
IORING_OP_URING_CMD for compute queues, NVIDIA / Huawei not yet shipped).
Once drivers add it, IORegistry handles GPU with zero code changes —
same register(id, ptr, on_cqe) → submit SQE → dispatch CQE pattern.
Current: fiber + worker pool
Worker pool always supports fiber. GPU task calls Fiber.workerYield(poll, ctx)
after submitting a kernel, freeing the worker thread to process other tasks while
the GPU runs. The worker tick polls parked fibers and resumes when the kernel completes.
// CPU task — no yield, runs to completion
Next.submit(CpuCtx, ctx, struct {
fn exec(c: *CpuCtx, complete: ...) void {
const result = heavyCompute(c.input);
complete(c, result);
}
}.exec);
// GPU task — MUST call workerYield after submitting kernel
// ↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓
Next.submit(GpuCtx, ctx, struct {
fn exec(c: *GpuCtx, complete: ...) void {
cudaLaunchKernel(kernel, stream, args);
Fiber.workerYield( // ← THIS LINE makes it a GPU task
struct { fn poll(s: *anyopaque) bool {
return cuStreamQuery(@ptrCast(@alignCast(s))) == CUDA_SUCCESS;
}}.poll,
@ptrCast(stream),
);
// resume point — GPU done
complete(c, output);
}
}.exec);The only difference between CPU and GPU: GPU tasks call Fiber.workerYield.
Without it, the worker thread blocks synchronously until the kernel completes,
defeating fiber multiplexing.
⚠️ GPU tasks MUST useNext.submit, neverNext.go.
Next.goruns on the IO thread. Two failure modes:
- Without
workerYield:cuStreamSynchronizeblocks the IO thread — io_uring CQE processing stops, entire server freezes.- With
workerYield: fiber yields correctly, IO thread stays alive — but the fiber never wakes up. The IO thread has no poll tick; it only responds to io_uring CQEs. GPU kernels don't produce CQEs, so the IO thread never learns the kernel finished.Worker threads have a built-in poll tick (
while poll_fn() try resume) which is why GPU works there:workerYield→ park → tick → poll → resume.
IMPORTANT: GPU uses initPool4NextSubmit(1).
GPU drivers are async internally — one worker + fiber can submit N streams
and poll for completion. No extra thread pool needed. io_uring not yet
supported for GPU compute (kernel driver gap).
RingShared is the materialization of a single io_uring ring + single thread — injected into server and any outbound client, all equal.
const rs = server.rs; // { ring, registry, invoke, io_tid }
// Any client is injected equally:
var client = try RingSharedClient.init(alloc, rs, ...);
var http = try HttpClient.init(alloc, ring_b, cache);rs.ringPtr()/rs.registryPtr()— IO-thread assertion guard (non-IO thread access → @panic)rs.invoke.push()— any-thread-safe CAS callback (worker → IO thread)
io_uring-driven outbound TCP client. Glue layer for integrating NATS / Redis / HTTP client libraries into sws's IO thread — no separate runtime, no locks.
const RingSharedClient = @import("sws").RingSharedClient;
fn onData(ctx: ?*anyopaque, data: []u8) void {
const nats: *NatsClient = @ptrCast(@alignCast(ctx));
nats.feed(data);
}
fn onClose(ctx: ?*anyopaque) void {
const nats: *NatsClient = @ptrCast(@alignCast(ctx));
nats.discard();
}
// In main(), before server.run():
var cs = try RingSharedClient.init(allocator, server.rs, onData, onClose, nats_ctx);
defer cs.deinit();
try cs.connect("127.0.0.1", 4222);
// Send data (queued, submitted via io_uring)
try cs.write("PUB subject 5\r\nhello\r\n");
cs.close(); // graceful- All I/O on sws IO thread —
onData/onCloserun in the same context as hooks write()queues data; pending writes auto-flushed as io_uring CQEs arrive- Protocol layer (NATS / Redis / HTTP) only needs
feed([]u8)andwrite([]const u8) - Multiple clients per server; user_data uses a dedicated high bit to avoid collisions
Single-entry TTL connection cache for outbound protocols. Owned by RingB — all
lifecycle (init, tick, evict, deinit) is managed automatically. Users get connection
reuse for free with HttpClient.
- Same host:port connections auto-reused within TTL window
- Expired entries auto-evicted by
RingB.tick()each event loop iteration - Connect phase allows retries; read/write phase forbids retries (kernel TCP stack guarantees SQE-level writes)
Adapts RingSharedClient's push model to a pull model (reader.read / writer.write).
Enables synchronous-protocol libraries (pgz, myzql) to run directly on the IO thread
via fiber yield/resume — no worker threads, no locks.
// In main(), after AsyncServer.init() and before server.run():
const Pipe = @import("sws").Pipe;
const RingSharedClient = @import("sws").RingSharedClient;
fn onData(ctx: ?*anyopaque, data: []u8) void {
const p: *Pipe = @ptrCast(@alignCast(ctx));
p.feed(data) catch {};
}
fn onClose(ctx: ?*anyopaque) void {
const p: *Pipe = @ptrCast(@alignCast(ctx));
p.reset();
}
var cs = try RingSharedClient.init(allocator, server.rs, onData, onClose, &pipe);
var pipe = try Pipe.init(allocator, cs);
defer pipe.deinit();
try cs.connect("localhost", 5432);
// ... wait for connect (yield) ...
// Any protocol lib with anytype reader/writer works:
// var conn = try pgz.Connection.init(allocator, pipe.reader(), pipe.writer());
// var result = try conn.query("SELECT 1", struct { u8 });feed(data)pushes bytes from ClientStream → read buffer, resumes waiting fiberreader.read()blocks the fiber (via yield) until data arrives — looks synchronous to callerwriter.write()queues into buffer;flushWrite()sends via ClientStreamreset()clears buffers on disconnect/reconnect- Requires protocol library to accept
anytypereader/writer (pgz needs 1-line patch onWriteBuffer.send)
For oversized requests (Content-Length > 32KB) that can't fit in the 64KB shared fiber stack. Pre-allocated 1MB blocks with O(1) freelist acquire/release.
const LargeBufferPool = @import("sws").LargeBufferPool;
// 64 blocks × 1MB = 64MB — built into AsyncServer by default
// Usage in oversized body path:
const buf = self.large_pool.acquire() orelse return error.OutOfLargeBuffers;
// io_uring READ CQE writes directly to buf.ptr
// ... process body ...
self.large_pool.release(buf);Independent io_uring Ring B for outbound HTTP client. Shares the kernel io-wq thread pool
via IORING_SETUP_ATTACH_WQ. TinyCache is built into RingB — same host:port connections
are automatically reused within the TTL window and evicted by RingB.tick().
const sws = @import("sws");
// Ring B init (attached to server's Ring A io-wq, 1s cache TTL):
var ring_b = try sws.HttpRing.init(allocator, io, server.ring.fd, 1000);
defer ring_b.deinit();
try ring_b.registerWith(&fiber_shared);
// HttpClient — cache is automatically managed by RingB:
var http_client = try sws.HttpClient.init(allocator, &ring_b);
defer http_client.deinit();
// Use from handler:
const resp = try http_client.get("http://api.example.com/data");
defer resp.deinit();
// POST with body:
const resp2 = try http_client.post("http://api.example.com/submit", "{\"key\":\"val\"}");Built-in DnsResolver covers basic needs (A record + TTL cache). For truncated UDP (TC bit → TCP retry) or SRV records, switch to c-ares:
sudo apt install libc-ares-devAdd to build.zig:
exe.linkSystemLibrary("cares");Switch DNS backend:
const HttpCaresDns = sws.HttpCaresDns;
// ring.dns = HttpCaresDns.init(alloc, ring.rs);Built-in fiber (x86_64 and ARM64 Linux). All handler fibers share a single pre-allocated stack buffer (stored in AsyncServer.shared_fiber_stack) — sequential execution, no per-request stack allocation, zero contention.
⚠️ Do NOT usestd.Io.async()/future.await()in handlers.Zig's
Futureis a thread-based design, not fiber-based:
async()→std.Thread.spawn+ queued to OS thread pool (Threaded.zig:2112)await()→Thread.futexWait— blocks the OS thread (Threaded.zig:2436)On the IO thread, blocking means:
- io_uring CQE processing stops — no new connections, no reads, no writes
- The entire server stalls for the duration of the work
future.await()requires the caller's stack frame to persist across suspension:var future = io.async(work, .{data}); const result = future.await(io); // fiber yields here — stack must survive ctx.json(200, result); // resumes here — expects data still intactSWS uses a shared stack (one 64KB buffer, all fibers reuse it). When a fiber yields in
await(), the next fiber's execution overwrites that same memory. The resumed fiber's stack frame is corrupted.Switching to per-fiber stacks would fix this, but at a steep memory cost:
Concurrent requests Per-fiber stack Shared stack 1K 16 MB 64 KB 20K 320 MB 64 KB 200K 3.2 GB 64 KB 1M 16 GB 64 KB (per-fiber stack at 16KB — the practical minimum for HTTP handlers)
At a typical production load of 200K concurrent requests, shared stack saves ~3GB. This directly translates to lower memory pressure and better operational stability.
This is the fundamental tradeoff: Future API semantics vs. 1M-connection memory model. SWS chooses the latter. All async is done via
Next.go/Next.submitwith callbacks instead ofawait-style suspension.
- Fibers are cooperative; OS threads are preemptive. This breaks the fiber model.
Zig pattern SWS replacement io.async(cpuWork)+future.await(io)Next.submit(Ctx, ctx, exec)+DeferredResponseio.async(ioWork)+future.await(io)Next.go(Ctx, ctx, exec)(fiber on IO thread)Pattern:
// ❌ Don't do this in handler — blocks IO thread: // var future = io.async(heavyWork, .{data}); // const result = future.await(io); // ✅ Do this instead — IO thread never blocks: fn myHandler(allocator: Allocator, ctx: *Context) anyerror!void { ctx.deferred = true; const resp = try allocator.create(DeferredResponse); resp.* = .{ .server = server, .conn_id = ctx.conn_id, .allocator = allocator }; Next.submit(Ctx, .{ .resp = resp, .data = data }, exec); }See
Next.submitsection above for the full exec/complete callback API.
See example/ and src/example.zig.
| Component | Size | Notes |
|---|---|---|
| StackSlot (per connection) | 320 bytes | 5 cache-line-aligned sub-structures |
| StackPool (1M slots) | ~400 MB | 384B per StackSlot, contiguous, warmup-touched |
| Freelist (1M u32) | 4 MB | O(1) acquire/release |
| Read buffer (idle) | 0 bytes | io_uring provided buffers, returned on idle |
| Slab for io_uring reads | 64 MB | 16384 × 4KB blocks, kernel-recycled |
| Tiered write pool | dynamic | 8 size classes (512B–64KB), freelist-recycled |
| Shared fiber stack | 64 KB | All fibers share one pre-allocated stack |
| LargeBufferPool | 64 MB | 64 × 1MB blocks for oversized requests |
| 1M idle connections | ~540 MB | No per-thread stack overhead |
Like greatws, idle connections consume zero buffer memory.
The 384-byte StackSlot is split across independent cache lines:
- line1 (64B): fd, gen_id, state, write_offset — only this is touched during CQE dispatch
- line2 (64B): conn_id, last_active_ms, active_list_pos — only touched during TTL scanning
- line3 (64B): fiber_context, large_buf_ptr — async anchors (Worker Pool / oversized bodies)
- line4 (128B): response_buf, write_iovs, WS queue — write path, not in the hot path
- line5 (64B): sentinel + workspace union — protocol parser scratch, zero extra allocation
The IO loop's hottest path (CQE dispatch → slot lookup) only touches line1. TTL scanning only touches line2. No cache-line ping-pong between unrelated operations.
WS handlers may offload frame data asynchronously, so frame payloads must remain valid after handler returns. WS frame payloads are always duped — never zero-copy.
Performance impact (100B text frame):
| Operation | Cost | Notes |
|---|---|---|
| memcpy(100B) | ~10ns | Copy frame payload |
| GeneralPurposeAllocator alloc/free | ~100ns | One alloc+free per frame |
~110ns overhead per frame. 1M connections, 1% active, 10 msg/s each = 100K msg/s:
- CPU: 100K × 110ns = 11ms/s = 1.1% of one core
| key | default | description |
|---|---|---|
fiber_stack_size_kb |
64 | fiber stack size (KB). 0 = 64 |
io_cpu |
null | pin IO thread to CPU core |
idle_timeout_ms |
30000 | close idle connections |
write_timeout_ms |
5000 | close stuck-write connections |
buffer_size |
4096 | io_uring buffer block size |
buffer_pool_size |
16384 | number of buffer blocks |
Cross-thread safe callback to IO thread. Underneath is rs.invoke (CAS lock-free linked list), drained automatically in drainTick.
server.invokeOnIoThread(MyCtx, ctx, struct {
fn run(allocator, c: *MyCtx) void {
// Runs on IO thread — safe to access ring/registry
c.client.write("PUB ...");
allocator.free(c.data);
}
}.run);Wire your DB driver's TCP fd into io_uring directly:
handler (fiber on IO thread):
└── db.query(sql)
└── io_uring write(fd, query) → CQE → io_uring read(fd) → CQE → parse
→ ctx.json(200, result)
For connection pooling: maintain a pool of connected TCP fds in a ringbuffer. Handler pops fd, issues write(sql) + read() via io_uring, parses result, pushes fd back.
MIT