Skip to content

dev-pratap-singh/memory

Repository files navigation

Advanced Memory Management System for LLMs

A production-ready memory management system that solves context window limitations for Large Language Models through intelligent compression and retrieval.

Status: ✅ Production Ready | Tested: 10x Scale (1.28M tokens) | Compression: 49:1 | Retrieval: 3.80ms


The Problem: Context Rot

When working with LLMs on large codebases or long conversations, you hit the context window limit:

  • GPT-4o-mini: 128k tokens (~100k words)
  • Large frontend + backend + docs + conversation = Out of space
  • LLM either can't process everything or "forgets" earlier information

The Solution: Smart Memory Management

Think of it as a smart filing cabinet for your LLM:

How It Works

  1. Intelligent Storage

    User sends: "Here's my 50k token frontend code"
    
    System:
    ├─ Full content → PostgreSQL database
    ├─ Compressed summary → LLM-generated (500 tokens)
    └─ Working prompt: "Frontend code (ID: abc123) - React app with auth..."
    
  2. Automatic Compression (at 75% capacity)

    • Monitors context usage in real-time
    • Auto-compresses lowest priority items
    • Keeps retrieval hints in the prompt
  3. Dynamic Retrieval

    LLM needs: "Let me work on frontend authentication"
    System: Retrieves full 50k tokens from database
    → LLM gets complete context when needed
    

Two Approaches

Approach 1: Context Partitioning (Default - Active)

  • ✅ External LLMs (GPT-4, Claude, etc.)
  • ✅ Automatic compression & retrieval
  • ✅ Fully tested & production-ready

Approach 2: Dual-Model with SLM (Optional - Experimental)

  • Small Language Model (Phi-2, Gemma-3) as memory assistant
  • LoRA fine-tuning support
  • Requires GPU (not tested due to unavailability)
  • Set USE_SLM=false (default)

Note: Approach 1 is recommended and fully functional. Approach 2 requires GPU resources for proper testing.


Quick Start

# 1. Setup
git clone <repository-url>
cd memory
cp .env.example .env

# 2. Add your API keys to .env
# OPENAI_API_KEY=sk-...

# 3. Start services (PostgreSQL + API)
docker-compose up -d

# 4. Initialize database
docker exec -i memory_postgres psql -U postgres -d memory_db < src/database/schema.sql

Features

Core Functionality

  • 🗜️ Auto-Compression: At 75% capacity, compresses old/low-priority items
  • 🔍 Semantic Search: pgvector-powered similarity search
  • 📊 Real-time Monitoring: Token usage, costs, context utilization
  • 🧹 Chat Hygiene: Auto-prunes content older than 30 days
  • 🔐 Context Quarantine: Isolate sensitive contexts in separate threads

Management Tools

  • Tasks: Create, track, and manage tasks with priorities
  • Agents: Register specialized agents with capabilities
  • HIL (Human-in-Loop): Request and handle human input
  • State Export: Export memory state as JSON for UI visualization

Supported Models

  • External LLMs: GPT-4, GPT-4o, GPT-4o-mini, GPT-4 Turbo, Claude 3 (all variants)
  • Optional SLMs: Gemma-3 4B, Microsoft Phi-2, Mistral 7B (8-bit quantized)

Usage Example

from src.memory_manager import LLMMemoryManager

# Initialize
manager = LLMMemoryManager(
    db_host="localhost",
    db_port=5432,
    db_name="memory_db",
    db_user="postgres",
    db_password="postgres",
    model_name="gpt-4o-mini",
    openai_api_key="sk-...",
    memory_approach="external_llm"
)

# Add large content (auto-compresses when needed)
frontend_id = manager.add_memory_item(
    content=huge_frontend_code,  # 50k tokens
    content_type="frontend",
    priority=8
)

backend_id = manager.add_memory_item(
    content=huge_backend_code,   # 40k tokens
    content_type="backend",
    priority=8
)

# Check memory state
state = manager.get_memory_state()
print(f"Context: {state['context_utilization_percentage']:.1f}% used")

# Semantic search
results = manager.search_memory_semantic(
    query="JWT authentication validation",
    limit=5
)

# Retrieve full content when needed
full_frontend = manager.retrieve_memory_item(frontend_id)

# Create tasks
task_id = manager.create_task(
    title="Fix auth bug",
    priority=9
)

# Export state for UI
state_json = manager.export_memory_state_json()

See tests/test_example_memory_usage.py for complete workflow.


REST API

The system provides REST API endpoints for external LLM integration:

Memory Management

# Add item
POST /memory-manager/items/add
{"content": "...", "content_type": "frontend", "priority": 8}

# Retrieve item
POST /memory-manager/items/retrieve
{"item_id": "abc123", "decompress": true}

# Semantic search
POST /memory-manager/items/search
{"query": "authentication", "limit": 5}

# Get state
GET /memory-manager/state

# Export state
GET /memory-manager/state/export

Task Management

POST /memory-manager/tasks/create
POST /memory-manager/tasks/update
GET  /memory-manager/tasks

Agent Management

POST /memory-manager/agents/create
GET  /memory-manager/agents

Start server: docker-compose up -d or uvicorn src.api.main:app --reload


Configuration

Environment Variables (.env)

# Database
DB_HOST=localhost
DB_PORT=5432
DB_NAME=memory_db
DB_USER=postgres
DB_PASSWORD=postgres

# API Keys
OPENAI_API_KEY=sk-...
ANTHROPIC_API_KEY=sk-ant-...

# Memory Approach
MEMORY_APPROACH=external_llm  # or 'slm'

# Primary LLM
PRIMARY_LLM_PROVIDER=openai
PRIMARY_LLM_MODEL=gpt-4o-mini

# SLM (Optional)
USE_SLM=false
SLM_MODEL_NAME=microsoft/phi-2
USE_LORA_FINETUNING=false

Model Context Configuration (src/utils/model_context.json)

Defines context limits, compression triggers, and token costs for each model:

{
  "models": {
    "gpt-4o-mini": {
      "context_length": 128000,
      "compression_trigger": 0.75,  // Compress at 75%
      "safe_threshold": 0.85,
      "cost_per_1k_input": 0.00015,
      "cost_per_1k_output": 0.0006
    }
  }
}

Architecture

┌─────────────────────────────────────────────────┐
│  External LLM (GPT-4, Claude)                   │
│  • Makes API calls to memory system             │
│  • Reasoning & response generation              │
└────────────────┬────────────────────────────────┘
                 │
                 ▼
┌─────────────────────────────────────────────────┐
│  Memory Management API (FastAPI)                │
│  • /memory-manager/items/add                    │
│  • /memory-manager/items/retrieve               │
│  • /memory-manager/items/search                 │
│  • /memory-manager/state                        │
└────────────────┬────────────────────────────────┘
                 │
                 ▼
┌─────────────────────────────────────────────────┐
│  Memory Manager (src/memory_manager.py)         │
│  • Context monitoring (75% trigger)             │
│  • Automatic compression                        │
│  • Priority-based retention                     │
│  • Token tracking                               │
└────────────────┬────────────────────────────────┘
                 │
                 ▼
┌─────────────────────────────────────────────────┐
│  PostgreSQL + pgvector                          │
│  • Full content storage                         │
│  • Vector embeddings (384-dim)                  │
│  • Semantic similarity search                   │
│  • 6.45ms median retrieval                      │
└─────────────────────────────────────────────────┘

Optional SLM layer (when USE_SLM=true):

Small Language Model (Phi-2/Gemma-3)
├─ Memory query interface
├─ LoRA fine-tuning support
└─ Continuous learning from conversations

Performance & Scale Testing

The system has been extensively tested at extreme scale to verify production readiness.

Baseline Performance

Metric Value
Retrieval Speed 6.45ms median (15.5 queries/sec)
Scale Tested 500K+ conversations
Database Size ~2GB for 500K conversations
Context Capacity 700-800 conversations in 125k tokens
Auto-Compression Triggers at 75% capacity
Cost <$1 for full scale test (gpt-4o-mini)

Extreme Scale Tests

5x Context Size Test (640,000 tokens)

Testing with 5x the model's context (gpt-4o-mini 128k → 640k tokens):

Metric Result
Input Data 640,000 tokens
Compressed To 71,792 tokens
Compression Ratio 8.9:1
Context Utilization 56.09%
Search Time 440ms average
Retrieval Time 5.67ms average
Processing Time 125.76s (12 chunks)
Status All tests passed

Key Finding: System successfully handled 5x context size with efficient compression and fast retrieval.

10x Context Size Test (1.28 Million tokens)

Testing with 10x the model's context (gpt-4o-mini 128k → 1.28M tokens):

Metric Result
Input Data 1,280,000 tokens
Compressed To 26,157 tokens
Compression Ratio 48.94:1 🚀
Context Utilization 20.44%
Available Space 101,843 tokens remaining
Search Time 448ms average
Retrieval Time 3.80ms average
Processing Time 11.86 minutes (32 chunks)
Status Excellent performance

Key Finding: System achieved remarkable 49:1 compression while maintaining fast search and retrieval. 80% of context space remained available after storing 1.28M tokens!

Question Answering Performance

Both scale tests included real-world Q&A scenarios:

10x Test Questions:

  • "How is JWT token authentication implemented?" → ✅ Correct (482ms)
  • "What PostgreSQL optimization techniques are used?" → ✅ Correct (411ms)
  • "How is the AWS infrastructure configured?" → ✅ Correct (401ms)
  • "What monitoring metrics does Prometheus collect?" → ✅ Correct (594ms)

All queries found accurate answers from compressed data with sub-600ms total time (search + retrieval).

Performance Comparison

Test Compression Search Retrieval Assessment
5x (640k) 8.9:1 440ms 5.67ms ✅ Excellent
10x (1.28M) 48.94:1 448ms 3.80ms ✅ Outstanding

Insight: Performance improves at larger scale due to more aggressive compression strategies while maintaining accuracy.

Production Readiness

Proven at extreme scale - Handles 10x context size ✅ Fast retrieval - Sub-4ms average even with 1.28M tokens ✅ Efficient compression - Up to 49:1 compression ratio ✅ Accurate results - 100% accuracy in Q&A tests ✅ Cost effective - Minimal API costs due to compression

Run tests yourself: python3 tests/test_performance_10x.py


Project Structure

memory/
├── src/
│   ├── api/                  # FastAPI REST API
│   ├── database/             # PostgreSQL + SQLAlchemy
│   │   └── schema.sql        # Database schema
│   ├── models/               # LLM/SLM interfaces
│   ├── tools/                # Memory tools
│   ├── utils/                # Config & utilities
│   │   └── model_context.json # Model configurations
│   └── memory_manager.py     # Core memory manager
├── tests/
│   └── test_example_memory_usage.py  # Usage example
├── config.yaml               # System settings
├── .env.example              # Environment template
├── docker-compose.yml        # Docker services
├── Dockerfile                # Container config
└── requirements.txt          # Python dependencies

Key Concepts

Priority-Based Memory

High priority items stay uncompressed longer:

manager.add_memory_item(content=bug_report, priority=10)  # Highest
manager.add_memory_item(content=readme, priority=5)       # Lower

Working Context

System maintains optimal context window:

  • Full items: Recently used or high-priority (uncompressed)
  • Compressed items: Older or low-priority (summary only)
  • Retrieval hints: One-line pointers with IDs

Chat Hygiene

Automatic cleanup:

  • Items not accessed in 7 days → compressed
  • Items older than 30 days → archived
  • Low priority (< 5) → pruned first

Context Quarantine

Isolate sensitive/complex contexts:

thread_id = manager.quarantine_context(
    content=sensitive_debug_info,
    context_type="security_debug",
    reason="Contains API keys and tokens"
)

Future Research & Development Directions

The following areas represent promising directions for advancing memory management in LLM systems:

1. Memory-Augmented Architectures

Recent research from Apple demonstrates that decoupling knowledge from reasoning through Hierarchical Memory Banks can achieve comparable performance to models 2x larger. Small anchor models (160M parameters) augmented with 4.6B memory banks require only ~10% additional runtime FLOPs compared to traditional RAG (2.3x overhead). This approach enables post-hoc knowledge augmentation for frozen open-source models with granular privacy control and direct knowledge editing capabilities.

Reference: Apple Research on Memory-Augmented Models

2. Context Compression via Latent Encoding

Encoding long text into compact latent representations before LLM inference could drastically reduce costs. By training an encoder to compress contexts (e.g., 4K tokens → 50 tokens) and fine-tuning the LLM to interpret compressed inputs, we can:

  • Reduce O(n²) attention complexity
  • Lower memory footprint and inference costs
  • Enable long-document reasoning on smaller GPUs
  • Maintain reversible mapping through optional decoders

3. Advanced Attention Mechanisms

Infini-Attention (Google) introduces compressive memory to transformers, enabling handling of 1M+ tokens with constant memory usage. The technique combines:

  • Normal local attention for recent tokens
  • Long-term compressed memory for distant tokens
  • Summary memory slots for older information

This allows models to maintain performance over extended contexts without proportional memory growth.

Reference: "Leave No Context Behind" research paper

4. Alternative Architectures Beyond Transformers

Projects like Megalodon (Meta AI/USC) explore architectures with linear computational complexity:

  • Uses exponential moving averages instead of standard attention
  • Achieves unlimited context length capability
  • Outperforms standard transformers (LLaMA-7B) on long-text tasks
  • Based on state-space models and RNN-like elements

Reference: Megalodon GitHub Repository

5. Sliding Window with Recurrent Memory

Simpler heuristic approaches using context partitioning and segment-based processing:

  • Break long text into manageable chunks
  • Carry forward state/summaries between segments
  • Inspired by Transformer-XL's recurrent memory of hidden states
  • Enables infinite context through sequential segment processing with persistent memory

6. Production-Grade Context Engineering

Anthropic's research on effective context engineering provides battle-tested strategies for:

  • Optimal prompt structure and organization
  • Context prioritization and placement
  • Dynamic context management in production systems
  • Agent-based memory retrieval patterns

Reference: Anthropic: Effective Context Engineering for AI Agents

Integration Opportunities

These approaches can be combined with the current system to:

  • Replace or augment the compression layer with learned encoders
  • Implement Infini-Attention for the primary LLM
  • Explore alternative architectures for the optional SLM component
  • Apply sliding window strategies to the context partitioning approach
  • Adopt production context engineering patterns from Anthropic

Author

Dev Pratap Singh Senior AI Engineer Indian Institute of Technology (IIT) Goa

LinkedIn


License

MIT License


Built with Claude Code - Production-ready memory management for LLMs.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors