docs: extract stable knowledge into ADRs and trim plan.md#2
Merged
Conversation
Add parseLlamaConfig for model_type "llama" mapping HuggingFace field names to ModelMetadata. Default rope_theta is 500000 (Llama 3 default). Register in DefaultArchConfigRegistry. Add testdata/llama3_config.json fixture based on Llama 3.1 8B config. Table-driven tests + fixture-based test verify all field mappings including RoPE scaling. Implements T57.2 (S57.2.1 through S57.2.4). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add parseMistralConfig for model_type "mistral" (default rope_theta 10000, adds sliding_window). Add parseQwenConfig for model_type "qwen2" (attention_bias always true, default rope_theta 1000000, YaRN rope scaling support). Register both in DefaultArchConfigRegistry. Add testdata fixtures: mistral7b_config.json, qwen25_7b_config.json. Table-driven + fixture-based tests verify all field mappings. Implements T57.3 (S57.3.1 through S57.3.4).
Add parsePhiConfig for model_type "phi3"/"phi" with partial_rotary_factor (default 1.0) and tie_word_embeddings. Add parseDeepSeekConfig for model_type "deepseek_v3" with MLA fields (kv_lora_rank, q_lora_rank, qk_rope_head_dim) and MoE fields (n_routed_experts, num_experts_per_tok, n_shared_experts). Register all in DefaultArchConfigRegistry. Extend ModelMetadata with 7 new fields: PartialRotaryFactor, KVLoRADim, QLoRADim, QKRopeHeadDim, NumExperts, NumExpertsPerToken, NumSharedExperts. Add testdata fixtures: phi4_config.json, deepseek_v3_config.json. Table-driven + fixture-based tests verify all field mappings. Implements T57.4 (S57.4.1 through S57.4.4).
Update loadMetadata to unmarshal config.json into raw map, then dispatch to DefaultArchConfigRegistry for architecture-specific field mapping. Overlays chat_template from raw JSON (internal field not in HuggingFace configs). Existing Gemma loading continues to work. Llama, Mistral, Qwen, Phi, and DeepSeek configs now parsed correctly via their registered parsers. Unknown model_type falls back to generic parsing. Implements T57.5 (S57.5.1 through S57.5.3).
ParamResolver interface maps architecture-specific weight names to canonical names used by Zerfoo layers. Phi renames dense_proj to o_proj; all other families (Llama, Gemma, Mistral, Qwen, DeepSeek) use canonical naming. Generic ResolveAll helper creates alias maps. 100% coverage on param_resolver.go.
Add variadic BuildOption with WithParamResolver to BuildFromZMF. When supplied, the resolver adds canonical parameter name aliases via ResolveAll so layer builders can look up parameters by canonical name even when the ZMF file uses architecture-specific names. Backward compatible: no options = no resolver = existing behavior.
Move dirRegistry, forward pass, greedy decode, and generation test patterns into helpers_test.go. Simplify Gemma 3 test files to use the shared helpers, eliminating code duplication.
… tests Env-gated by LLAMA3_ZMF_PATH / LLAMA3_MODEL_DIR. Tests skip gracefully when model files are absent. Covers forward pass shape validation, 5-step greedy decode, and inference API generation (greedy determinism, stream parity, chat).
Extract modelParityConfig, runModelForwardPass, runModelGreedyDecode, runModelGeneration helpers to eliminate structural duplication across model test files. Migrate Gemma 3 and Llama 3 tests to the new pattern, and merge gemma3_generation_test.go into gemma3_test.go.
… tests Three env-gated parity tests for Mistral models using the shared modelParityConfig pattern. Skips gracefully when MISTRAL_ZMF_PATH and MISTRAL_MODEL_DIR are not set.
The buildGroupedQueryAttention registry builder now looks up optional bias parameters (name_wq_bias, name_wk_bias, name_wv_bias, name_wo_bias) and passes them to NewDenseFromParams. When absent, behavior is unchanged (backward compatible). This enables Qwen models which use attention_bias=true.
WithYaRNScaling(factor, origMaxLen) modifies inverse frequencies: - Low frequency (wavelength > factor*origMaxLen): scaled by 1/factor - Intermediate (origMaxLen <= wavelength <= factor*origMaxLen): interpolated - High frequency (wavelength < origMaxLen): unchanged AttentionScaleFactor() returns sqrt(1 + ln(factor)/ln(origMaxLen)). Without YaRN, behavior is unchanged (backward compatible).
The buildGroupedQueryAttention function now reads optional rope_scaling_type, rope_scaling_factor, and rope_scaling_orig_max_len attributes and passes WithYaRNScaling to the RoPE constructor.
WithGlobalAttributes injects extra attributes into every node during graph construction. The inference loader now converts ModelMetadata.RopeScaling into global attributes so GQA nodes receive YaRN parameters without the ZMF file needing to carry them.
Forward pass, greedy decode, and generation tests gated by QWEN25_ZMF_PATH and QWEN25_MODEL_DIR environment variables.
WithRotaryDimFraction controls what fraction of head dimensions receive rotation. Default is 1.0 (all dims rotated). When fraction < 1.0, the Forward/Backward methods split into rotated and pass-through portions. Phi-4 uses 0.75 for partial RoPE.
The buildGroupedQueryAttention function now reads an optional partial_rotary_factor attribute and passes WithRotaryDimFraction to the RoPE constructor when fraction < 1.0.
When ModelMetadata.PartialRotaryFactor is set (0 < f < 1), inject it as a global attribute so GQA nodes receive partial RoPE configuration.
NewTiedLMHead creates an LMHead that reuses the token embedding weight matrix (transposed) instead of owning its own. The tied LMHead has no trainable parameters since the embedding layer owns the weight.
Forward pass, greedy decode, and generation tests gated by PHI4_ZMF_PATH and PHI4_MODEL_DIR environment variables.
Implements MLA as used in DeepSeek V3/R1. Compresses KV into a low-rank latent vector via down-projection, then up-projects to K and V. Includes RoPE integration and SDPA.
Reads num_heads, head_dim, kv_lora_dim, max_seq_len from attributes and loads W_Q, W_DKV, W_UK, W_UV, W_O from node parameters. Includes tests for missing attributes, missing params, and custom rope_base.
When SharedExpert is non-nil, its output is added to the weighted routed expert sum for every token. When nil, behavior is unchanged (backward compatible).
Env-gated by DEEPSEEK_ZMF_PATH / DEEPSEEK_MODEL_DIR. Includes forward pass, greedy decode, and generation tests.
Update docs/plan.md to mark E69 (final verification) complete. Add Section 12 to docs/design.md documenting supported model families, architecture-specific features, and config registry.
Trim plan.md from 3058 to 272 lines by extracting stable design decisions into 6 ADR files and replacing ~2500 lines of detailed task breakdowns with phase summaries and ADR cross-references. New files: - docs/adr/001-enterprise-production-readiness.md (Phases 4+7) - docs/adr/002-distributed-training-protocol.md (Phase 5) - docs/adr/003-open-weights-model-import.md (Phase 6) - docs/adr/004-embeddable-inference-library.md (Phase 8) - docs/adr/005-multi-architecture-support.md (Phase 9) - docs/adr/006-gpu-engine-architecture.md (Phases 2-3) Updates to design.md (+25 lines): - Section 7.2: added 6 model families' parity test env vars - Section 12.4: added parameter name resolver note - Section 13: added ADR index table Preserved in plan.md: blocked item E29 (GPU validation), operating procedure, hand-off notes, final scorecard, and new packages table.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
docs/plan.mdfrom 3058 to 272 lines (91% reduction) by extracting stable design decisions into 6 ADR filesdocs/design.md(+25 lines)New ADR files (
docs/adr/)Test plan