First quantitative validation of channel causality in a from-scratch channel-causal architecture. (Channel state demonstrably changes generation — see §"Scope of validation" below for what this does and does NOT claim.)
A 145M-parameter decoder with six proprioceptive channels (memory, affect, time, ethics, identity, continuity) integrated via cross-attention with ReZero gating, in every layer.
Built between 2026-05-02 and 2026-05-04 by Mario Gutierrez at Celiums Solutions LLC.
This repository contains the code, methodology, evaluation suite, training/eval logs, and per-iteration metrics for the tinyMARS architectural validation experiment.
The strong claim of MARS (Multi-channel Architecture for Real-time Subjectivity) is that the architecture is a new category, not a fine-tuning trick. The parallel MARS-Real track validated MARS as an adapter on top of Google's frozen Gemma 4 E2B-it. tinyMARS asks the harder question: if we train the channels into the model from layer one, with no pretrained backbone underneath, do they actually become causal in generation?
After four iterations:
- The affect channel rose from 40% pass rate (iter 3, skills+papers corpus) to 80% pass rate (iter 4, after adding ~210K rows of affect-rich dialog), on a 20-test ablation suite where flipping the channel state should change the response.
alpha_l2(the L2 norm of the ReZero alpha gates, our proxy for "are channels actually contributing") sustained at 0.323 at end of iter 4 training.- 7 of 8 measured zero-distance ablation tests in iter 4 had
cosine_distance(response_with_channels, response_with_zeroed_channels) > 0.04, the threshold above which channels are demonstrably modifying output.
Scope of validation. These results validate channel causality: channel state changes generation in measurable, judge-recognized ways. They do not validate the stronger architectural delta — Native (with channels) vs Baseline (no channels) trained on the same corpus at the same scale. Neither iter 3 nor iter 4 ran a parallel baseline at their respective corpus scales (decision: NO_BASELINE in both iteration JSONs), so the "+25pp delta" criterion for full architectural validation is not met. See docs/english/00-executive-summary.md §Limitations and docs/english/08-conclusions.md for the full caveats.
For the full paper-grade analysis with all numbers, see docs/closure/final-report.md or the rendered PDF docs/closure/tinyMARS-final-report.pdf.
For the conversational story (NotebookLM-friendly), see docs/narrative/.
- Not a deployable model. Eval loss 1.47 with 145M params is below conversational competence. We are publishing the architectural prototype, not weights, because the model is below the threshold where releasing weights would produce a useful product. Plan B (the parallel MARS-Real track on Gemma 4 E2B-it) is the deployable artifact and will be released separately.
- Not a replacement for transformers. TinyMARSDecoder is a vanilla decoder with one structural addition: a
ChannelInjectionblock per channel inserted at every layer. - Not a tokenizer innovation. Standard 32K BPE.
- Not an attention-mechanism innovation. Standard self-attention with RoPE.
- Not the full Celiums system. Hyphae (the structured external substrate) is a separate research/product track and is not part of this OSS work.
.
├── README.md this file
├── LICENSE MIT
├── CHANGELOG.md
├── .gitignore
├── requirements.txt deps for training + eval
├── src/
│ └── tinymars_model.py TinyMARSDecoder + ChannelInjection + ReZero
├── training/
│ ├── train_tinymars.py training loop
│ ├── build_pretrain_corpus.py corpus builder
│ ├── tokenizer_train.py custom 32K BPE
│ ├── parquet_corpus.py corpus utilities
│ └── random_corpus.py synthetic corpus for smoke tests
├── eval/
│ ├── eval_tinymars.py eval runner
│ ├── eval_suite.json the 20 ablation tests
│ ├── channels.py channel state class
│ ├── checkpoint_model.py model loader
│ ├── judges.py LLM-as-judge methodology
│ ├── stub_model.py testing infrastructure
│ ├── tests.py test definitions
│ └── chat_tinymars.py interactive REPL against a checkpoint
├── pipeline/
│ ├── gen_tinymars_corpus_{A,B,C,D}.py corpus generators (one per Tipo)
│ ├── expand_base_prompts.py
│ ├── expand_memory_narratives.py
│ ├── inference_client.py LLM API wrapper (BYOK via env vars)
│ ├── personas.json 12 synthetic personas for Tipo C
│ ├── affect_quadrants.py PAD quadrant definitions
│ ├── topic_pool.json diverse topics
│ ├── base_prompts_A.json seed prompts for affect-controlled pairs
│ ├── substrate_prompts.json seed prompts for substrate-aware pairs
│ └── memory_narratives.json synthetic episodic memories
├── scripts/
│ ├── chat_iter3.sh chat against a remote checkpoint
│ ├── watch.sh, watch_iter3.sh live training dashboard
│ └── render_final_pdf.py markdown → PDF
├── docs/
│ ├── 00-mision.md → 05-roadmap.md project design (Spanish)
│ ├── english/00-08-*.md paper-grade English docs
│ ├── narrative/00-05-*.md conversational narrative (NotebookLM-friendly)
│ ├── iterations/iter[1-4].json raw metrics per iteration
│ ├── closure/
│ │ ├── final-report.md paper-grade summary
│ │ └── tinyMARS-final-report.pdf rendered PDF
│ └── procedures/PROCEDURES.md reproducibility procedures
└── artifacts/
├── iter[1-4]-summary.json final metrics per iteration
├── iter[3,4]-eval.json full per-test eval results
└── iter[3,4]_native.log training logs (audit trail)
Reproducing tinyMARS requires:
- Hardware: an H200 SXM 141 GB or equivalent. Iter 3 trained 7,687 steps in 6 hours. Iter 4 trained 19,073 steps in 15 hours.
- Base model: none. tinyMARS trains from random initialization with a custom 32K-BPE tokenizer.
- Corpus: skills + papers as replay (we used internal Celiums corpora, not in this repo), plus channel-causal pairs generated via
pipeline/gen_tinymars_corpus_{A,B,C,D}.pyusing a teacher LLM (we used Anthropic Opus 4.7 via DO Inference). - API access: a teacher LLM (Anthropic Opus or equivalent) and a judge LLM (Sonnet or equivalent). API keys via environment variables — see
pipeline/inference_client.py.
# 1. Install
pip install -r requirements.txt
# 2. Set API keys
export DO_INFERENCE_API_KEY="<your-do-inference-key>"
# or set the equivalent env var for your provider
# 3. Generate corpus (Tipo A example)
python pipeline/gen_tinymars_corpus_A.py --out data/A/ --target 3000
# 4. Train tokenizer
python training/tokenizer_train.py --corpus data/ --out data/tokenizer/tokenizer-32k.model
# 5. Build packed pretraining corpus
python training/build_pretrain_corpus.py --out data/corpus/packed/
# 6. Train
python -m training.train_tinymars \
--variant native \
--max-steps 19073 \
--bs 8 --grad-accum 16 --seq-len 2048 \
--lr 6e-4 --warmup-ratio 0.02 --weight-decay 0.10 \
--channel-dropout-p 0.5 --channel-dropout-warmup-steps 100 \
--train-data "data/corpus/packed/train-*.parquet" \
--eval-data "data/corpus/packed/eval-*.parquet" \
--out checkpoints/iter5-native
# 7. Evaluate
python -m eval.eval_tinymars \
--ckpt checkpoints/iter5-native/latest.pt \
--suite eval/eval_suite.json- Model weights. tinyMARS is below conversational threshold; releasing weights would produce confusion ("Celiums released a model that doesn't work well"). Plan B (4.6B refined Gemma 4 E2B base) is the deployable artifact, in a separate forthcoming repo.
- Generated corpus. The corpus channel-causal pairs (Tipos A/B/C/D) were generated using teacher LLMs whose terms of service may restrict redistribution of generated text. We publish the methodology and prompts so anyone can regenerate their own corpus.
- Personal data. Some episodic memory pairs in Tipo B were patterned on actual journal entries. Not redistributed.
- Internal Celiums corpus (skills + papers replay). Comes from the parallel Celiums knowledge-engine track, separate licensing.
- Hyphae internals. Hyphae is a separate research/product track. Conceptual references in some docs; no implementation in this repo.
MIT for code, evaluation, and documentation. See LICENSE.
If you use this work, please cite:
@misc{tinymars2026,
author = {Mario Gutierrez},
title = {tinyMARS: First quantitative validation of channel causality in a from-scratch channel-causal architecture},
year = {2026},
month = {May},
howpublished = {\url{https://github.com/terrizoaguimor/tinymars}},
note = {Celiums Solutions LLC, paper-grade research artifact}
}- Website: https://celiums.ai/
- Research post: https://celiums.ai/research/tinymars
- Twitter/X: @CeliumsKnowledge
- Issues: please open a GitHub issue for technical questions
Anthropic Claude Opus 4.7 and Claude 4.6 Sonnet, served via DigitalOcean Gradient AI Inference, used for corpus quality scoring, judge filtering of generated channel-causal pairs, and rewrite tasks during identity-defense corpus refinement. Use governed by the DigitalOcean Gradient AI Inference terms of service.
Custom 32K BPE tokenizer trained from scratch on the tinyMARS corpus using SentencePiece. tinyMARS does not derive from any pretrained tokenizer.
MARS-Real (separate track, not part of this OSS release): a continued pretraining derivative of Google's Gemma 4 E2B-it, used in compliance with the Gemma Terms of Use. MARS-Real is the deployable artifact; tinyMARS (this work) is the architectural validation prototype trained from random initialization.
Single H200 SXM 141 GB cloud instance via DigitalOcean Cloud GPU Droplets plus teacher/judge LLM calls via DigitalOcean Gradient AI Inference.
Direct cost breakdown (~$1,110 total):
| Item | Cost | Notes |
|---|---|---|
| Teacher / judge LLM (Opus 4.7 + Sonnet 4.6) | ~$945 | ~33,000 generation+judge cycles across all four iterations of channel-causal corpus |
| H200 SXM 141 GB droplet | ~$160 | 2 days × $3.34/h — all training iterations + corpus building |
| DOKS compute-burst | ~$5 | parallel corpus generation |
| DO Spaces (S3-compatible) | <$1 | shard distribution |
Important note on cost shape: the dominant cost is inference, not compute. Generating channel-causal training pairs with a frontier teacher LLM (Anthropic Claude Opus 4.7) costs more than running the H200 itself for the same window. This is a useful number to be honest about: tinyMARS-class architectural validation experiments are inference-bound when the corpus is teacher-generated. Future iterations could reduce this materially by using smaller teacher models for non-critical pair categories or by replacing teacher-generated content with curated public datasets where the channel signal is naturally present.
Not included: Mario's time; pre-existing Celiums infrastructure (memory engine, knowledge engine, OpenSearch, Postgres + pgvector, MCP tools) developed over the prior nine months and not chargeable to this experiment; snapshot retention going forward (~$12/month). Estimates use observed API pricing as of May 2026 (Opus 4.7: $15/M input, $75/M output; Sonnet 4.6: $3/M input, $15/M output) and per-pair token averages from production runs; final billing reconciliation pending.
- ReZero — Bachlechner et al. (2020) — alpha-gate identity-init pattern for residual connections.
- ACT-R framework — Anderson et al. — procedural memory architecture and sub-symbolic activation.
- BGE-M3 — BAAI — multilingual 1024-d embeddings used for memory / identity / continuity channels.
- RoPE — Su et al. (2021) — rotary positional embeddings.
Built between 2026-05-02 and 2026-05-04. Total direct cost: ~$1,110, dominated by teacher LLM inference for corpus generation. Honest reporting of what we proved and what we didn't.