Empty Neural Networks (ENN) is an architectural framework demonstrating that under extreme parameter constraints (e.g., the 16MB limit of the OpenAI Parameter Golf challenge), language models should transition from being storage-bound factual databases to state-bound dynamic routing systems.
This repository implements a Hybrid Sub-Quadratic Architecture (V5) prioritizing temporal state continuity, extreme VRAM efficiency, and cognitive state swapping for Agentic OS environments.
Traditional scaling laws rely on Transformer MLPs acting as key-value stores for factual knowledge. However, when artificially constrained to approximately 15.5M - 35M parameters to fit within a 16MB serialized artifact, this paradigm encounters a structural bottleneck.
Current optimization strategies focus on incremental compression. [cite_start]H-ENN proposes a fundamental shift: utilizing sub-quadratic models with reduced linear compute and constant memory requirements[cite: 13]. By combining linear SSM backbones with sparse attention mechanisms, we redefine the inductive bias of the forward pass to support continuous 24/7 agentic workflows without KV-cache explosion.
H-ENN redefines the role of neural parameters. Instead of encoding a static dataset, the parameters (
[cite_start]To mitigate the retrieval downsides of fixed state-sizes in pure linear models, H-ENN employs a hybrid architecture that interleaves linear layers with self-attention[cite: 389, 390].
- The Subconscious (Mamba-3 / GDN Backbone): Utilizes a multi-input, multi-output (MIMO) formulation for better model performance and hardware utilization without increasing decode latency[cite: 16, 48]. [cite_start]It features a complex-valued state update rule that enables richer state tracking capabilities[cite: 16].
- The Precision Cache (Attention): Every 4th layer acts as a Sliding Window Attention block, providing an exact-match L1-cache for immediate context retrieval.
Standard sequence-level training fails to stabilize deep state-space models and consumes excessive VRAM. H-ENN utilizes Tick-Based Truncated BPTT. The sequence is processed in discrete temporal "ticks", detaching the gradient graph at each step while maintaining the continuous flow of the compressed state.
This allows for massive batch sizes (e.g., 32k+ tokens per step) with near-zero VRAM scaling, and enables agents to "sleep" by serializing their compact KV/Mamba states to an SSD.
To maximize the 10-minute training budget, the framework utilizes Stochastic Weight Averaging (SWA) for late-stage minimum smoothing, combined with Data-Dependent Initialization (Algorithmic Presets) to bypass the unigram/bigram learning phases.
# Encapsulated Tick-Based Logic with State Swapping:
def forward(self, idx, targets=None, states=None):
if states is None:
states = [None] * self.config.n_layer
total_loss = 0
seq_len = idx.size(1)
chunk_size = self.config.chunk_size # e.g., 128
# Process the sequence in discrete temporal clock-cycles
for i in range(0, seq_len, chunk_size):
x_chunk = idx[:, i:i+chunk_size]
y_chunk = targets[:, i:i+chunk_size] if targets is not None else None
logits, loss, states = self.step(x_chunk, y_chunk, states)
# Detach states to bound the gradient graph while maintaining state flow
# This keeps VRAM footprint strictly constant (O(1) w.r.t sequence length)
states = [
tuple(v.detach() for v in s) if isinstance(s, tuple)
else (s.detach() if s is not None else None)
for s in states
]
return logits, loss, states| Feature | Specification |
|---|---|
| Parameters | ~36M (Expanded via INT6/INT4 Sub-byte Quantization) |
| Architecture | 12 Layers (Hybrid 1:4 Attention-to-SSM ratio) |
| Embedding Dim | 768 |
| Core Layers | Mamba-3 (MIMO / Complex State) + Sliding Window Attention |
| Optimization | AdamW + SWA (Stochastic Weight Averaging) |
| Precision | BFloat16 Training -> INT6 Quantized Deployment (15.9 MB) |
Preliminary tests on localized hardware (NVIDIA T4) demonstrate hyper-efficient convergence due to constant memory bounds.
- Baseline: Randomization barrier (Loss 7.16)
- H-ENN (V5.1): Achieved Loss 3.10 within 600 steps (equivalent to ~3.5 minutes on H100 architecture).
- Official Submission: [Awaiting final H100 execution with INT6 payload]
Pavel Shalyhin — AI Solutions Architect / Advanced RAG & Agentic Systems / Custom LLM Engineering
Focused on the development of resilient cognitive architectures and neuro-symbolic memory systems. Founder of OneCeroOne (Local RAG engines) and MYCELIUM (Memory-as-a-Service for AI agents). Architect of the ALICE meta-cognitive platform utilizing the Planner-Executor pattern.
"My approach is Architecture First. I design the logic, data flows, and failure points before writing the first line of code."
Contact: contact@onecero.one
GitHub: shalyhinpavel
LinkedIn: pavel-shalyhin
Disclaimer: This project is part of an ongoing research into sub-quadratic efficiency and cognitive state management in constrained environments.