dndungu · dndungu · Mar 3, 2026 · Mar 3, 2026 · Mar 3, 2026 · Mar 3, 2026
@@ -0,0 +1,92 @@
+# ADR-001: Enterprise Production Readiness
+
+**Status:** Accepted
+**Phase:** 4 + 7
+**Date:** 2026-03-01
+
+## Context
+
+Zerfoo had strong foundations (clean interfaces, modular architecture, type-safe
+generics, 95%+ test coverage) but lacked operational hardening for enterprise
+production deployment. Gaps existed in observability, security, reliability,
+configuration management, and CI/CD enforcement.
+
+Additionally, a Phase 7 architecture review identified structural issues: dead
+code (pkg/prelude, tests/helpers nil stubs), inverted layer registration
+dependency (layers/core -> model), and thread-unsafe graph memo map.
+
+## Decision
+
+### Observability (E21-E22)
+
+- **Structured logging** via `log/` package: Logger interface with Debug/Info/
+  Warn/Error levels, JSON output mode, NopLogger for zero-overhead disabling.
+  All packages instrumented (compute, distributed, training, model, cmd).
+- **Metrics** via `metrics/runtime/` package: Collector interface with Counter,
+  Gauge, Histogram. InMemoryCollector for testing; NopCollector for production
+  disabling. CPUEngine/GPUEngine and distributed ops instrumented.
+
+### Security (E23)
+
+- TLS/mTLS for all gRPC via `distributed.TLSConfig`. Plaintext fallback for
+  local dev (nil config). Input validation on all RPC handlers (completed in
+  Phase 5 via E32).
+
+### Configuration (E24)
+
+- `config/` package: `Load[T](path)` and `LoadWithEnv[T](path, prefix)` for
+  JSON config with env var overrides via struct tags. Validation via
+  `validate:"required"` tag. Standard structs: EngineConfig, TrainingConfig,
+  DistributedConfig.
+
+### Reliability (E25-E26, E28)
+
+- **Graceful shutdown** via `shutdown/` package: Coordinator with reverse-order
+  Closer execution, per-closer timeout, signal handling (SIGINT/SIGTERM) in CLI.
+- **Health checks** via `health/` package: HTTP /healthz (liveness), /readyz
+  (readiness with configurable checks), /debug/pprof/. Engine health check
+  verifies compute is operational.
+- **Resource limits**: MemoryTracker with CAS-based enforcement at Engine level.
+  Per-operation timeout via context.Context deadline checks.
+
+### CI/CD Hardening (E27)
+
+- Parity and numerics tests blocking in CI (removed `|| true`).
+- Coverage gate via `cmd/coverage-gate/`: fails if any testable package drops
+  below 93%.
+- Benchmark regression via `cmd/bench-compare/`: fails on >10% regression.
+- Race detector on all unit tests. Go 1.25 on Ubuntu + macOS runners.
+
+### Architecture Cleanup (Phase 7: E44-E46)
+
+- **Dead code removal**: Deleted pkg/prelude (empty), tests/helpers/wire.go
+  (4 nil stubs), 7 dead test files (17 always-skipping tests).
+- **Registration consolidation**: Removed init() from layers/core/registry.go.
+  Single entry point: `layers/registry.RegisterAll()`. Exported BuildFFN.
+- **Graph thread safety**: Added sync.Mutex to graph.Graph protecting memo map
+  in Forward/Backward. Coarse-grained lock (correct for graphs < 1000 nodes).
+
+## Consequences
+
+- All packages use leveled structured logging; no raw fmt.Printf in production.
+- Runtime metrics available for Prometheus scraping or in-memory snapshot.
+- gRPC is TLS-capable; plaintext is opt-in (nil TLSConfig).
+- CI enforces coverage >= 93%, benchmark regression < 10%, zero races.
+- Graph.Forward is safe for concurrent use from multiple goroutines.
+- No init()-based registration; single wiring point reduces coupling.
+
+### Blocked Item
+
+- **E29 GPU hardware validation**: Blocked on GCP GPU quota = 0. Quota request
+  pending (preference ID: zerfoo-gpu-test, project: numerai-488804). Unblock by
+  checking quota status or trying a different cloud provider.
+
+### Key Files
+
+- `log/logger.go` -- Logger interface, StdLogger, NopLogger
+- `metrics/runtime/metrics.go` -- Collector, InMemoryCollector, NopCollector
+- `config/loader.go` -- Load[T], LoadWithEnv[T]
+- `shutdown/coordinator.go` -- Closer, Coordinator
+- `health/server.go` -- Health HTTP server
+- `cmd/coverage-gate/main.go` -- CI coverage enforcement
+- `cmd/bench-compare/main.go` -- CI benchmark regression detection
@@ -0,0 +1,74 @@
+# ADR-002: Distributed Training Protocol
+
+**Status:** Accepted
+**Phase:** 5
+**Date:** 2026-03-01
+
+## Context
+
+The distributed package had auto-generated protobuf stubs, a coordinator server,
+InternalStrategy[T] interface, AllReduceStrategy[T], NetworkManager, and
+ServerManager. Missing: concrete DistributedServiceServer, GrpcStrategy[T]
+connecting strategy to transport, WorkerNode lifecycle management, and
+multi-worker integration tests.
+
+## Decision
+
+### AllReduce Protocol (Star Topology)
+
+Root (rank 0) runs the server collecting gradients from all peers. Each non-root
+worker opens a bidi stream to root, sends gradients as AllReduceRequest messages
+(one per named tensor), then waits for AllReduceResponse with averaged results.
+Root accumulates peer gradients plus its own, computes element-wise average
+(sum / world_size), and streams results back.
+
+A `reduceSession` struct coordinates across concurrent AllReduce stream handlers:
+collects tensors by name from each peer, waits via sync barrier, computes
+reduction, distributes result.
+
+Ring all-reduce optimization was explicitly deferred (correctness first).
+
+### Barrier Protocol
+
+Counter-based. Each worker calls Barrier RPC on root. Root counts arrivals and
+blocks via sync.Cond until all workers arrive. Epoch numbers prevent stale
+barrier responses.
+
+### Broadcast Protocol
+
+Root stores broadcast tensor in thread-safe map keyed by name. Non-root workers
+call Broadcast RPC on root. If tensor not yet available, handler waits with
+context deadline.
+
+### Tensor Serialization
+
+pb.Tensor uses repeated float (float32 only). GrpcStrategy[T] converts
+tensor.TensorNumeric[T] to/from pb.Tensor. For T=float64, values narrow to
+float32 for transport (acceptable for gradient averaging).
+
+### Worker Lifecycle
+
+WorkerNode struct encapsulates: GrpcStrategy, coordinator connection, health
+check registration, shutdown.Closer implementation. Start(ctx, cfg) initializes
+strategy, registers with coordinator, starts gRPC server, connects to peers.
+Close(ctx) triggers orderly shutdown.
+
+CLI `worker` subcommand: --coordinator-address, --worker-address, --worker-id
+flags. Signal handling via cli.SignalContext.
+
+## Consequences
+
+- Star topology is simple and correct but O(N) at root. Ring optimization
+  deferred to future phase.
+- All distributed operations tested end-to-end over real gRPC (bufconn).
+- TLS integration tested with self-signed certificates.
+- Worker lifecycle integrated with health checks and shutdown coordinator.
+- distributed/ package at 96% coverage.
+
+### Key Files
+
+- `distributed/worker_service.go` -- DistributedServiceServer implementation
+- `distributed/grpc_strategy.go` -- GrpcStrategy[T] (InternalStrategy over gRPC)
+- `distributed/worker_node.go` -- WorkerNode lifecycle
+- `distributed/integration_test.go` -- Multi-worker tests (bufconn)
+- `cmd/cli/worker.go` -- Worker CLI subcommand
@@ -0,0 +1,84 @@
+# ADR-003: Open Weights Model Import
+
+**Status:** Accepted
+**Phase:** 6
+**Date:** 2026-03-02
+
+## Context
+
+Zerfoo could train and run inference on models built with its layer API, but
+importing pre-trained open-weights models (Gemma 3, Kimi-VL) required closing
+gaps in the ONNX import pipeline (zonnx repo) and layer registry.
+
+Gap analysis identified blockers: zonnx converter missing TENSOR attribute and
+UINT8 dtype; MatMulNBits and Constant not registered; vision encoder operators
+(Conv2d, Pad, Slice, Resize, BatchNorm, GlobalAveragePool) not implemented;
+MoE not implemented.
+
+## Decision
+
+### 4-Bit Weight Packing
+
+MatMulNBits stores 4-bit weights packed two-per-byte in UINT8 tensors. ZMF uses
+DataType=UINT8. Dequantization happens in MatMulNBits.Forward() using
+numeric.Unpack4BitSlice. Supports symmetric and asymmetric quantization with
+per-block scales and optional zero-points.
+
+### Conv2d Strategy
+
+Direct nested-loop convolution (not im2col + MatMul). Simpler, correct for
+inference workloads, avoids allocating large intermediate matrices. Deviation
+from original plan noted.
+
+### MoE Design
+
+MoEGate routes tokens to top-k experts via softmax + topK selection. Gate weight
+passed as runtime Forward input (not from params) to match the ONNX/ZMF pattern.
+MixtureOfExperts dispatches to selected experts and aggregates weighted outputs.
+Expert loading from ZMF sub-graphs deferred (tech debt).
+
+### Operator Inventory
+
+New operators implemented and registered:
+
+| Operator | File | Category |
+|----------|------|----------|
+| Softmax | layers/activations/softmax.go | Activation |
+| Sigmoid builder | layers/activations/registry.go | Activation |
+| Erf | layers/activations/erf.go | Activation |
+| LayerNormalization | layers/normalization/registry.go | Normalization |
+| BatchNormalization | layers/normalization/batch_norm.go | Normalization |
+| Slice | layers/core/slice.go | Core |
+| Pad | layers/core/pad.go | Core |
+| TopK | layers/core/topk.go | Core |
+| Conv2d | layers/core/conv2d.go | Core |
+| GlobalAveragePool | layers/core/global_avg_pool.go | Core |
+| Resize | layers/core/resize.go | Core |
+| MoEGate | layers/core/moe.go | Core |
+| MixtureOfExperts | layers/core/moe.go | Core |
+| Constant | layers/core/constant.go | Core |
+
+### Multi-Repo Discipline
+
+zonnx and zerfoo are separate repos. Pre-commit hooks reject multi-directory
+commits. zonnx converter fixes committed in zonnx; layer/model changes in zerfoo.
+ONNX-to-ZMF conversion in zonnx handles special cases (Slice/Pad/TopK input
+promotion, Resize scales/sizes, MatMulNBits dequantization).
+
+## Consequences
+
+- Gemma 3 end-to-end import validated (forward pass + greedy decode).
+- SigLIP vision encoder and Kimi-VL connector validated.
+- 13 new operators added to registry; total 56+ layers.
+- MatMulNBits dequantization at converter level (not runtime) for standard
+  MatMul path.
+- Expert loading from sub-graphs is documented tech debt.
+- All parity tests are env-var gated (skip gracefully without model files).
+
+### Key Files
+
+- `layers/core/moe.go` -- MoEGate, MixtureOfExperts
+- `layers/core/conv2d.go` -- Conv2d (nested-loop)
+- `layers/core/constant.go` -- Constant node
+- `tests/parity/gemma3_test.go` -- Gemma 3 parity test
+- `tests/parity/siglip_test.go` -- SigLIP/Kimi-VL parity tests
@@ -0,0 +1,99 @@
+# ADR-004: Embeddable Inference Library
+
+**Status:** Accepted
+**Phase:** 8
+**Date:** 2026-03-02
+
+## Context
+
+Running inference on an imported model required extensive manual wiring: download
+ONNX files, convert with zonnx CLI, write Go code to create Engine/Graph, use
+whitespace-only tokenizer, call Forward in a manual loop (no KV cache, O(n^2)),
+no sampling beyond argmax, no streaming.
+
+Phase 8 transforms zerfoo into an embeddable inference library:
+
+```go
+m, _ := inference.Load("google/gemma-3-4b-it")
+resp, _ := m.Generate(ctx, "Explain quantum computing")
+```
+
+## Decision
+
+### BPE Tokenizer
+
+Pure Go, no CGo. Loads HuggingFace tokenizer.json format (vocab, merge rules,
+pre-tokenizer config, normalizer, special tokens). BPE merge loop: split into
+bytes, iteratively merge highest-priority adjacent pair. Byte-level BPE
+pre-tokenization (GPT-2 style). SentencePiece .model files not supported
+directly; most HuggingFace models ship tokenizer.json.
+
+### KV Cache
+
+GenerationContext embeds context.Context and carries *KVCache. KVCache stores
+per-layer key/value tensors (appended on each step). Attention layers
+(GroupQueryAttention, GlobalAttention) check for KVCache in context: if present,
+append current K/V, use full cached K/V for computation. Graph.Forward()
+signature unchanged (opt-in via context). Callers without KVCache get existing
+full-recompute behavior.
+
+### Generation Loop
+
+Generator holds graph, tokenizer, engine, model config. Autoregressive loop:
+1. Encode prompt to token IDs
+2. Forward pass for logits [1, seqLen, vocabSize]
+3. Extract last-position logits
+4. Apply: temperature scaling -> top-k filtering -> top-p filtering ->
+   repetition penalty -> sample (or argmax at temperature=0)
+5. Check stop conditions (EOS, max tokens, stop strings)
+6. Repeat with new token as input (KV cache handles prefix)
+
+### Streaming
+
+TokenStream interface with OnToken(token string, done bool) error. GenerateStream
+delivers each decoded token as generated. Sentinel-based stop-string detection
+with delta emission.
+
+### Model Registry
+
+Local cache under ~/.zerfoo/models/. Layout: <org>/<model_name>/ containing
+model.zmf, tokenizer.json, config.json. Pull: download from HuggingFace Hub API,
+convert ONNX to ZMF (zonnx as Go library), cache locally. HF_TOKEN env var for
+gated models.
+
+### HTTP Serve
+
+net/http server. OpenAI-compatible endpoints:
+- POST /v1/chat/completions (non-streaming + SSE)
+- POST /v1/completions (non-streaming + SSE)
+- GET /v1/models
+
+### Constraints
+
+- Pure Go, no CGo, no external C libraries for tokenization.
+- KV cache is opt-in; does not break existing callers.
+- Model registry works offline after initial pull.
+- No training through high-level API.
+- No multi-model serving.
+
+## Consequences
+
+- 3-line model loading and generation for end users.
+- O(n) per generation step with KV cache (vs O(n^2) without).
+- CLI commands (pull/run/serve) for interactive and server use.
+- OpenAI API compatibility enables tool interoperability.
+- Coverage: generate 95%, inference 96.4%, serve 96.4%.
+- Embeddings not yet supported (Embed returns error).
+
+### Key Files
+
+- `pkg/tokenizer/bpe.go` -- BPE tokenizer
+- `pkg/tokenizer/loader.go` -- tokenizer.json loader
+- `generate/kvcache.go` -- KV cache
+- `generate/generator.go` -- Autoregressive generation loop
+- `generate/sampling.go` -- Temperature, top-k, top-p, repetition penalty
+- `generate/stream.go` -- TokenStream interface
+- `registry/registry.go` -- ModelRegistry, LocalRegistry
+- `inference/inference.go` -- Load, Generate, Chat, GenerateStream
+- `serve/server.go` -- OpenAI-compatible HTTP server
+- `cmd/cli/{pull,run,serve}.go` -- CLI commands