Skip to content

shalyhinpavel/ENN

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 

Repository files navigation

Hybrid Empty Neural Networks (H-ENN)

Agentic OS Kernel for Extreme Parameter Budgets

License: MIT Framework: PyTorch

Abstract

Empty Neural Networks (ENN) is an architectural framework demonstrating that under extreme parameter constraints (e.g., the 16MB limit of the OpenAI Parameter Golf challenge), language models should transition from being storage-bound factual databases to state-bound dynamic routing systems.

This repository implements a Hybrid Sub-Quadratic Architecture (V5) prioritizing temporal state continuity, extreme VRAM efficiency, and cognitive state swapping for Agentic OS environments.


The Challenge: The 16MB Efficiency Limit

Traditional scaling laws rely on Transformer MLPs acting as key-value stores for factual knowledge. However, when artificially constrained to approximately 15.5M - 35M parameters to fit within a 16MB serialized artifact, this paradigm encounters a structural bottleneck.

Current optimization strategies focus on incremental compression. [cite_start]H-ENN proposes a fundamental shift: utilizing sub-quadratic models with reduced linear compute and constant memory requirements[cite: 13]. By combining linear SSM backbones with sparse attention mechanisms, we redefine the inductive bias of the forward pass to support continuous 24/7 agentic workflows without KV-cache explosion.


The Framework: Architecture First

H-ENN redefines the role of neural parameters. Instead of encoding a static dataset, the parameters ($\theta$) encode the rules of a dynamic memory system. Factual knowledge is not stored merely in the weights but exists transiently within the highly compressed trajectory of the hidden state.

1. Hybrid Routing (SSM + Sliding Window Attention)

[cite_start]To mitigate the retrieval downsides of fixed state-sizes in pure linear models, H-ENN employs a hybrid architecture that interleaves linear layers with self-attention[cite: 389, 390].

  • The Subconscious (Mamba-3 / GDN Backbone): Utilizes a multi-input, multi-output (MIMO) formulation for better model performance and hardware utilization without increasing decode latency[cite: 16, 48]. [cite_start]It features a complex-valued state update rule that enables richer state tracking capabilities[cite: 16].
  • The Precision Cache (Attention): Every 4th layer acts as a Sliding Window Attention block, providing an exact-match L1-cache for immediate context retrieval.

2. Cognitive State Swapping & Extreme Batching

Standard sequence-level training fails to stabilize deep state-space models and consumes excessive VRAM. H-ENN utilizes Tick-Based Truncated BPTT. The sequence is processed in discrete temporal "ticks", detaching the gradient graph at each step while maintaining the continuous flow of the compressed state.

This allows for massive batch sizes (e.g., 32k+ tokens per step) with near-zero VRAM scaling, and enables agents to "sleep" by serializing their compact KV/Mamba states to an SSD.

3. SWA & Algorithmic Presets

To maximize the 10-minute training budget, the framework utilizes Stochastic Weight Averaging (SWA) for late-stage minimum smoothing, combined with Data-Dependent Initialization (Algorithmic Presets) to bypass the unigram/bigram learning phases.

# Encapsulated Tick-Based Logic with State Swapping:
def forward(self, idx, targets=None, states=None):
    if states is None: 
        states = [None] * self.config.n_layer
    
    total_loss = 0
    seq_len = idx.size(1)
    chunk_size = self.config.chunk_size # e.g., 128
    
    # Process the sequence in discrete temporal clock-cycles
    for i in range(0, seq_len, chunk_size):
        x_chunk = idx[:, i:i+chunk_size]
        y_chunk = targets[:, i:i+chunk_size] if targets is not None else None
        
        logits, loss, states = self.step(x_chunk, y_chunk, states)
        
        # Detach states to bound the gradient graph while maintaining state flow
        # This keeps VRAM footprint strictly constant (O(1) w.r.t sequence length)
        states = [
            tuple(v.detach() for v in s) if isinstance(s, tuple) 
            else (s.detach() if s is not None else None) 
            for s in states
        ]
        
    return logits, loss, states

Technical Specifications

Feature Specification
Parameters ~36M (Expanded via INT6/INT4 Sub-byte Quantization)
Architecture 12 Layers (Hybrid 1:4 Attention-to-SSM ratio)
Embedding Dim 768
Core Layers Mamba-3 (MIMO / Complex State) + Sliding Window Attention
Optimization AdamW + SWA (Stochastic Weight Averaging)
Precision BFloat16 Training -> INT6 Quantized Deployment (15.9 MB)

Benchmark: OpenAI Parameter Golf (FineWeb-10B)

Preliminary tests on localized hardware (NVIDIA T4) demonstrate hyper-efficient convergence due to constant memory bounds.

  • Baseline: Randomization barrier (Loss 7.16)
  • H-ENN (V5.1): Achieved Loss 3.10 within 600 steps (equivalent to ~3.5 minutes on H100 architecture).
  • Official Submission: [Awaiting final H100 execution with INT6 payload]

Author

Pavel Shalyhin — AI Solutions Architect / Advanced RAG & Agentic Systems / Custom LLM Engineering

Focused on the development of resilient cognitive architectures and neuro-symbolic memory systems. Founder of OneCeroOne (Local RAG engines) and MYCELIUM (Memory-as-a-Service for AI agents). Architect of the ALICE meta-cognitive platform utilizing the Planner-Executor pattern.

"My approach is Architecture First. I design the logic, data flows, and failure points before writing the first line of code."

Contact: contact@onecero.one
GitHub: shalyhinpavel
LinkedIn: pavel-shalyhin


Disclaimer: This project is part of an ongoing research into sub-quadratic efficiency and cognitive state management in constrained environments.

About

Empty Neural Networks (ENN): Tick-Based Training for Extreme Parameter Budgets

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors