Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
51 commits
Select commit Hold shift + click to select a range
f505794
feat(inference): add Llama config parser with fixture test
dndungu Mar 3, 2026
9ee68f1
docs(plan): mark T57.2 complete (Llama config parser)
dndungu Mar 3, 2026
a8aa4ba
feat(inference): add Mistral and Qwen config parsers
dndungu Mar 3, 2026
e206534
docs(plan): mark T57.3 complete (Mistral and Qwen config parsers)
dndungu Mar 3, 2026
20d15de
feat(inference): add Phi and DeepSeek config parsers
dndungu Mar 3, 2026
06fb782
docs(plan): mark T57.4 complete (Phi and DeepSeek config parsers)
dndungu Mar 3, 2026
b82ca5d
feat(inference): integrate config registry into loadMetadata
dndungu Mar 3, 2026
640b7f7
docs(plan): mark T57.5 complete (config registry integration)
dndungu Mar 3, 2026
8b69258
docs(plan): mark T57.6 and E57 complete (all quality gates pass)
dndungu Mar 3, 2026
0926a0f
feat(model): add architecture-aware parameter name resolver
dndungu Mar 3, 2026
68f569e
docs(plan): mark T58.1 complete (parameter name resolver)
dndungu Mar 3, 2026
db209f4
feat(model): integrate ParamResolver into BuildFromZMF
dndungu Mar 3, 2026
c7afb92
feat: forward BuildOption through top-level BuildFromZMF wrapper
dndungu Mar 3, 2026
abeeeaa
docs(plan): mark T58.2 complete (resolver integration in builder)
dndungu Mar 3, 2026
9380f44
docs(plan): mark T58.3 and E58 complete (all quality gates pass)
dndungu Mar 3, 2026
4aa07c6
refactor(parity): extract shared test helpers from Gemma 3 tests
dndungu Mar 3, 2026
950e80f
feat(parity): add Llama 3 forward pass, greedy decode, and generation…
dndungu Mar 3, 2026
b2ec741
docs(plan): mark T59.1 complete (Llama 3 parity tests)
dndungu Mar 3, 2026
8f424c8
refactor(parity): consolidate model tests with modelParityConfig
dndungu Mar 3, 2026
d4688f1
feat(parity): add Mistral forward pass, greedy decode, and generation…
dndungu Mar 3, 2026
f51d8e1
docs(plan): mark T59.2 complete (Mistral parity tests)
dndungu Mar 3, 2026
3ffac84
docs(plan): mark T59.3 and E59 complete (all quality gates pass)
dndungu Mar 3, 2026
03a6143
feat(attention): add optional QKV bias support to GQA builder
dndungu Mar 3, 2026
86e5557
docs(plan): mark T60.1 complete (QKV bias support in GQA)
dndungu Mar 3, 2026
ebb3d79
docs(plan): mark T60.2 and E60 complete (QKV bias quality gates pass)
dndungu Mar 3, 2026
9841f39
feat(embeddings): add YaRN RoPE scaling for long-context models
dndungu Mar 3, 2026
6966a2b
docs(plan): mark T61.1 complete (YaRN RoPE scaling)
dndungu Mar 3, 2026
4c05d9d
feat(attention): support YaRN scaling in GQA builder
dndungu Mar 3, 2026
68b2530
feat(model): add WithGlobalAttributes and wire YaRN config
dndungu Mar 3, 2026
1784ba9
docs(plan): mark T61.2 complete (YaRN config integration)
dndungu Mar 3, 2026
6915ebc
docs(plan): mark T61.3 complete (E61 YaRN scaling verified)
dndungu Mar 3, 2026
dc087e7
test(parity): add Qwen 2.5 parity tests
dndungu Mar 3, 2026
0cb09d7
docs(plan): mark E62 complete (Qwen validation)
dndungu Mar 3, 2026
1d90bbc
feat(embeddings): add partial RoPE via WithRotaryDimFraction
dndungu Mar 3, 2026
d666fde
feat(attention): support partial RoPE in GQA builder
dndungu Mar 3, 2026
d96191d
feat(inference): wire partial_rotary_factor into global attributes
dndungu Mar 3, 2026
1bbac58
docs(plan): mark E63 complete (partial RoPE for Phi-4)
dndungu Mar 3, 2026
7fd6123
feat(core): add NewTiedLMHead for weight-tied embeddings
dndungu Mar 3, 2026
6d70340
docs(plan): mark E64 complete (tied embeddings)
dndungu Mar 3, 2026
eb84f15
test(parity): add Phi-4 parity tests
dndungu Mar 3, 2026
6a9443a
docs(plan): mark E65 complete (Phi-4 validation)
dndungu Mar 3, 2026
bcfd22f
feat(attention): add Multi-head Latent Attention (MLA) layer
dndungu Mar 3, 2026
1acf2a6
feat(attention): add BuildMultiHeadLatentAttention builder
dndungu Mar 3, 2026
5a81b9d
feat(registry): register MultiHeadLatentAttention in RegisterAll
dndungu Mar 3, 2026
f4b96f2
docs(plan): mark E66 complete (Multi-head Latent Attention)
dndungu Mar 3, 2026
06a12ea
feat(core): add shared expert support to MixtureOfExperts
dndungu Mar 3, 2026
beba4bd
docs(plan): mark E67 complete (shared expert MoE)
dndungu Mar 3, 2026
13b0974
test(parity): add DeepSeek V3 parity tests
dndungu Mar 3, 2026
88499c8
docs(plan): mark E68 complete (DeepSeek V3 validation)
dndungu Mar 3, 2026
8d45548
docs: mark Phase 9 complete and add multi-architecture section
dndungu Mar 3, 2026
74cf90e
docs: extract stable knowledge into ADRs and trim plan.md
dndungu Mar 3, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
92 changes: 92 additions & 0 deletions docs/adr/001-enterprise-production-readiness.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,92 @@
# ADR-001: Enterprise Production Readiness

**Status:** Accepted
**Phase:** 4 + 7
**Date:** 2026-03-01

## Context

Zerfoo had strong foundations (clean interfaces, modular architecture, type-safe
generics, 95%+ test coverage) but lacked operational hardening for enterprise
production deployment. Gaps existed in observability, security, reliability,
configuration management, and CI/CD enforcement.

Additionally, a Phase 7 architecture review identified structural issues: dead
code (pkg/prelude, tests/helpers nil stubs), inverted layer registration
dependency (layers/core -> model), and thread-unsafe graph memo map.

## Decision

### Observability (E21-E22)

- **Structured logging** via `log/` package: Logger interface with Debug/Info/
Warn/Error levels, JSON output mode, NopLogger for zero-overhead disabling.
All packages instrumented (compute, distributed, training, model, cmd).
- **Metrics** via `metrics/runtime/` package: Collector interface with Counter,
Gauge, Histogram. InMemoryCollector for testing; NopCollector for production
disabling. CPUEngine/GPUEngine and distributed ops instrumented.

### Security (E23)

- TLS/mTLS for all gRPC via `distributed.TLSConfig`. Plaintext fallback for
local dev (nil config). Input validation on all RPC handlers (completed in
Phase 5 via E32).

### Configuration (E24)

- `config/` package: `Load[T](path)` and `LoadWithEnv[T](path, prefix)` for
JSON config with env var overrides via struct tags. Validation via
`validate:"required"` tag. Standard structs: EngineConfig, TrainingConfig,
DistributedConfig.

### Reliability (E25-E26, E28)

- **Graceful shutdown** via `shutdown/` package: Coordinator with reverse-order
Closer execution, per-closer timeout, signal handling (SIGINT/SIGTERM) in CLI.
- **Health checks** via `health/` package: HTTP /healthz (liveness), /readyz
(readiness with configurable checks), /debug/pprof/. Engine health check
verifies compute is operational.
- **Resource limits**: MemoryTracker with CAS-based enforcement at Engine level.
Per-operation timeout via context.Context deadline checks.

### CI/CD Hardening (E27)

- Parity and numerics tests blocking in CI (removed `|| true`).
- Coverage gate via `cmd/coverage-gate/`: fails if any testable package drops
below 93%.
- Benchmark regression via `cmd/bench-compare/`: fails on >10% regression.
- Race detector on all unit tests. Go 1.25 on Ubuntu + macOS runners.

### Architecture Cleanup (Phase 7: E44-E46)

- **Dead code removal**: Deleted pkg/prelude (empty), tests/helpers/wire.go
(4 nil stubs), 7 dead test files (17 always-skipping tests).
- **Registration consolidation**: Removed init() from layers/core/registry.go.
Single entry point: `layers/registry.RegisterAll()`. Exported BuildFFN.
- **Graph thread safety**: Added sync.Mutex to graph.Graph protecting memo map
in Forward/Backward. Coarse-grained lock (correct for graphs < 1000 nodes).

## Consequences

- All packages use leveled structured logging; no raw fmt.Printf in production.
- Runtime metrics available for Prometheus scraping or in-memory snapshot.
- gRPC is TLS-capable; plaintext is opt-in (nil TLSConfig).
- CI enforces coverage >= 93%, benchmark regression < 10%, zero races.
- Graph.Forward is safe for concurrent use from multiple goroutines.
- No init()-based registration; single wiring point reduces coupling.

### Blocked Item

- **E29 GPU hardware validation**: Blocked on GCP GPU quota = 0. Quota request
pending (preference ID: zerfoo-gpu-test, project: numerai-488804). Unblock by
checking quota status or trying a different cloud provider.

### Key Files

- `log/logger.go` -- Logger interface, StdLogger, NopLogger
- `metrics/runtime/metrics.go` -- Collector, InMemoryCollector, NopCollector
- `config/loader.go` -- Load[T], LoadWithEnv[T]
- `shutdown/coordinator.go` -- Closer, Coordinator
- `health/server.go` -- Health HTTP server
- `cmd/coverage-gate/main.go` -- CI coverage enforcement
- `cmd/bench-compare/main.go` -- CI benchmark regression detection
74 changes: 74 additions & 0 deletions docs/adr/002-distributed-training-protocol.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,74 @@
# ADR-002: Distributed Training Protocol

**Status:** Accepted
**Phase:** 5
**Date:** 2026-03-01

## Context

The distributed package had auto-generated protobuf stubs, a coordinator server,
InternalStrategy[T] interface, AllReduceStrategy[T], NetworkManager, and
ServerManager. Missing: concrete DistributedServiceServer, GrpcStrategy[T]
connecting strategy to transport, WorkerNode lifecycle management, and
multi-worker integration tests.

## Decision

### AllReduce Protocol (Star Topology)

Root (rank 0) runs the server collecting gradients from all peers. Each non-root
worker opens a bidi stream to root, sends gradients as AllReduceRequest messages
(one per named tensor), then waits for AllReduceResponse with averaged results.
Root accumulates peer gradients plus its own, computes element-wise average
(sum / world_size), and streams results back.

A `reduceSession` struct coordinates across concurrent AllReduce stream handlers:
collects tensors by name from each peer, waits via sync barrier, computes
reduction, distributes result.

Ring all-reduce optimization was explicitly deferred (correctness first).

### Barrier Protocol

Counter-based. Each worker calls Barrier RPC on root. Root counts arrivals and
blocks via sync.Cond until all workers arrive. Epoch numbers prevent stale
barrier responses.

### Broadcast Protocol

Root stores broadcast tensor in thread-safe map keyed by name. Non-root workers
call Broadcast RPC on root. If tensor not yet available, handler waits with
context deadline.

### Tensor Serialization

pb.Tensor uses repeated float (float32 only). GrpcStrategy[T] converts
tensor.TensorNumeric[T] to/from pb.Tensor. For T=float64, values narrow to
float32 for transport (acceptable for gradient averaging).

### Worker Lifecycle

WorkerNode struct encapsulates: GrpcStrategy, coordinator connection, health
check registration, shutdown.Closer implementation. Start(ctx, cfg) initializes
strategy, registers with coordinator, starts gRPC server, connects to peers.
Close(ctx) triggers orderly shutdown.

CLI `worker` subcommand: --coordinator-address, --worker-address, --worker-id
flags. Signal handling via cli.SignalContext.

## Consequences

- Star topology is simple and correct but O(N) at root. Ring optimization
deferred to future phase.
- All distributed operations tested end-to-end over real gRPC (bufconn).
- TLS integration tested with self-signed certificates.
- Worker lifecycle integrated with health checks and shutdown coordinator.
- distributed/ package at 96% coverage.

### Key Files

- `distributed/worker_service.go` -- DistributedServiceServer implementation
- `distributed/grpc_strategy.go` -- GrpcStrategy[T] (InternalStrategy over gRPC)
- `distributed/worker_node.go` -- WorkerNode lifecycle
- `distributed/integration_test.go` -- Multi-worker tests (bufconn)
- `cmd/cli/worker.go` -- Worker CLI subcommand
84 changes: 84 additions & 0 deletions docs/adr/003-open-weights-model-import.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,84 @@
# ADR-003: Open Weights Model Import

**Status:** Accepted
**Phase:** 6
**Date:** 2026-03-02

## Context

Zerfoo could train and run inference on models built with its layer API, but
importing pre-trained open-weights models (Gemma 3, Kimi-VL) required closing
gaps in the ONNX import pipeline (zonnx repo) and layer registry.

Gap analysis identified blockers: zonnx converter missing TENSOR attribute and
UINT8 dtype; MatMulNBits and Constant not registered; vision encoder operators
(Conv2d, Pad, Slice, Resize, BatchNorm, GlobalAveragePool) not implemented;
MoE not implemented.

## Decision

### 4-Bit Weight Packing

MatMulNBits stores 4-bit weights packed two-per-byte in UINT8 tensors. ZMF uses
DataType=UINT8. Dequantization happens in MatMulNBits.Forward() using
numeric.Unpack4BitSlice. Supports symmetric and asymmetric quantization with
per-block scales and optional zero-points.

### Conv2d Strategy

Direct nested-loop convolution (not im2col + MatMul). Simpler, correct for
inference workloads, avoids allocating large intermediate matrices. Deviation
from original plan noted.

### MoE Design

MoEGate routes tokens to top-k experts via softmax + topK selection. Gate weight
passed as runtime Forward input (not from params) to match the ONNX/ZMF pattern.
MixtureOfExperts dispatches to selected experts and aggregates weighted outputs.
Expert loading from ZMF sub-graphs deferred (tech debt).

### Operator Inventory

New operators implemented and registered:

| Operator | File | Category |
|----------|------|----------|
| Softmax | layers/activations/softmax.go | Activation |
| Sigmoid builder | layers/activations/registry.go | Activation |
| Erf | layers/activations/erf.go | Activation |
| LayerNormalization | layers/normalization/registry.go | Normalization |
| BatchNormalization | layers/normalization/batch_norm.go | Normalization |
| Slice | layers/core/slice.go | Core |
| Pad | layers/core/pad.go | Core |
| TopK | layers/core/topk.go | Core |
| Conv2d | layers/core/conv2d.go | Core |
| GlobalAveragePool | layers/core/global_avg_pool.go | Core |
| Resize | layers/core/resize.go | Core |
| MoEGate | layers/core/moe.go | Core |
| MixtureOfExperts | layers/core/moe.go | Core |
| Constant | layers/core/constant.go | Core |

### Multi-Repo Discipline

zonnx and zerfoo are separate repos. Pre-commit hooks reject multi-directory
commits. zonnx converter fixes committed in zonnx; layer/model changes in zerfoo.
ONNX-to-ZMF conversion in zonnx handles special cases (Slice/Pad/TopK input
promotion, Resize scales/sizes, MatMulNBits dequantization).

## Consequences

- Gemma 3 end-to-end import validated (forward pass + greedy decode).
- SigLIP vision encoder and Kimi-VL connector validated.
- 13 new operators added to registry; total 56+ layers.
- MatMulNBits dequantization at converter level (not runtime) for standard
MatMul path.
- Expert loading from sub-graphs is documented tech debt.
- All parity tests are env-var gated (skip gracefully without model files).

### Key Files

- `layers/core/moe.go` -- MoEGate, MixtureOfExperts
- `layers/core/conv2d.go` -- Conv2d (nested-loop)
- `layers/core/constant.go` -- Constant node
- `tests/parity/gemma3_test.go` -- Gemma 3 parity test
- `tests/parity/siglip_test.go` -- SigLIP/Kimi-VL parity tests
99 changes: 99 additions & 0 deletions docs/adr/004-embeddable-inference-library.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,99 @@
# ADR-004: Embeddable Inference Library

**Status:** Accepted
**Phase:** 8
**Date:** 2026-03-02

## Context

Running inference on an imported model required extensive manual wiring: download
ONNX files, convert with zonnx CLI, write Go code to create Engine/Graph, use
whitespace-only tokenizer, call Forward in a manual loop (no KV cache, O(n^2)),
no sampling beyond argmax, no streaming.

Phase 8 transforms zerfoo into an embeddable inference library:

```go
m, _ := inference.Load("google/gemma-3-4b-it")
resp, _ := m.Generate(ctx, "Explain quantum computing")
```

## Decision

### BPE Tokenizer

Pure Go, no CGo. Loads HuggingFace tokenizer.json format (vocab, merge rules,
pre-tokenizer config, normalizer, special tokens). BPE merge loop: split into
bytes, iteratively merge highest-priority adjacent pair. Byte-level BPE
pre-tokenization (GPT-2 style). SentencePiece .model files not supported
directly; most HuggingFace models ship tokenizer.json.

### KV Cache

GenerationContext embeds context.Context and carries *KVCache. KVCache stores
per-layer key/value tensors (appended on each step). Attention layers
(GroupQueryAttention, GlobalAttention) check for KVCache in context: if present,
append current K/V, use full cached K/V for computation. Graph.Forward()
signature unchanged (opt-in via context). Callers without KVCache get existing
full-recompute behavior.

### Generation Loop

Generator holds graph, tokenizer, engine, model config. Autoregressive loop:
1. Encode prompt to token IDs
2. Forward pass for logits [1, seqLen, vocabSize]
3. Extract last-position logits
4. Apply: temperature scaling -> top-k filtering -> top-p filtering ->
repetition penalty -> sample (or argmax at temperature=0)
5. Check stop conditions (EOS, max tokens, stop strings)
6. Repeat with new token as input (KV cache handles prefix)

### Streaming

TokenStream interface with OnToken(token string, done bool) error. GenerateStream
delivers each decoded token as generated. Sentinel-based stop-string detection
with delta emission.

### Model Registry

Local cache under ~/.zerfoo/models/. Layout: <org>/<model_name>/ containing
model.zmf, tokenizer.json, config.json. Pull: download from HuggingFace Hub API,
convert ONNX to ZMF (zonnx as Go library), cache locally. HF_TOKEN env var for
gated models.

### HTTP Serve

net/http server. OpenAI-compatible endpoints:
- POST /v1/chat/completions (non-streaming + SSE)
- POST /v1/completions (non-streaming + SSE)
- GET /v1/models

### Constraints

- Pure Go, no CGo, no external C libraries for tokenization.
- KV cache is opt-in; does not break existing callers.
- Model registry works offline after initial pull.
- No training through high-level API.
- No multi-model serving.

## Consequences

- 3-line model loading and generation for end users.
- O(n) per generation step with KV cache (vs O(n^2) without).
- CLI commands (pull/run/serve) for interactive and server use.
- OpenAI API compatibility enables tool interoperability.
- Coverage: generate 95%, inference 96.4%, serve 96.4%.
- Embeddings not yet supported (Embed returns error).

### Key Files

- `pkg/tokenizer/bpe.go` -- BPE tokenizer
- `pkg/tokenizer/loader.go` -- tokenizer.json loader
- `generate/kvcache.go` -- KV cache
- `generate/generator.go` -- Autoregressive generation loop
- `generate/sampling.go` -- Temperature, top-k, top-p, repetition penalty
- `generate/stream.go` -- TokenStream interface
- `registry/registry.go` -- ModelRegistry, LocalRegistry
- `inference/inference.go` -- Load, Generate, Chat, GenerateStream
- `serve/server.go` -- OpenAI-compatible HTTP server
- `cmd/cli/{pull,run,serve}.go` -- CLI commands
Loading
Loading