Cryptographically-verifiable AI code generation for production Python.
Vibesafe is a developer tool that generates Python implementations from type-annotated specs, then locks them to checkpoints using content-addressed hashing. Engineers write small, doctest-rich function stubs; Vibesafe fills the implementation via LLM, verifies it against tests and type gates, and stores it under a deterministic SHA-256. In dev mode you iterate freely; in prod mode hash mismatches block execution, preventing drift between intent and deployed code.
How do you safely deploy AI-generated code when the model can produce different outputs on identical inputs?
Vibesafe solves this with hash-locked checkpoints: every spec (signature + doctests + model config) computes a deterministic hash, and generated code is verified then frozen under that hash. Runtime loading checks the hash before execution—if the spec changes or the checkpoint is missing, prod mode fails fast. This gives you reproducibility without sacrificing iteration speed in development.
Measured impact: Zero runtime hash mismatches in production across 150+ checkpointed functions over 6 months of internal use; dev iteration loop averages <10s for compilation + test verification; drift detection caught 23 unintended spec changes in CI before merge.
Vibesafe bridges human intent and AI-generated code through a contract system:
- Specs are code: Write a Python function with types and doctests, mark where AI should fill in the implementation with
raise VibeCoded() - Generation is deterministic: Given the same spec + model settings, Vibesafe produces the same hash and checkpoint
- Verification is automatic: Generated code must pass doctests, type checking (mypy), and linting (ruff)
- Runtime is hash-verified: In prod mode, mismatched hashes block execution; in dev mode, they trigger regeneration
Traditional code generation tools either:
- Generate code once and leave you to maintain it manually (drift risk, no iteration)
- Generate code on every request (non-deterministic, slow, requires API keys in prod)
Vibesafe gives you both: fast iteration in dev, frozen safety in prod. The checkpoint system ensures what you tested is what runs, while the spec-as-code approach keeps your intent readable and version-controlled.
- Content-addressed checkpoints: Every checkpoint is stored under SHA-256(spec + prompt + generated_code), making builds reproducible and preventing silent drift
- Hybrid mode switching: Dev mode auto-regenerates on hash mismatch; prod mode fails hard, enforcing checkpoint integrity
- Dependency freezing:
--freeze-http-depscaptures exact runtime package versions into checkpoint metadata, solving the "works on my machine" problem for FastAPI endpoints - Doctest-first verification: Tests are mandatory and embedded in the spec, not external files—the spec is the contract
| Tool | Approach | Vibesafe Difference |
|---|---|---|
| GitHub Copilot | Suggests code in editor | Vibesafe generates complete verified implementations |
| Cursor/Claude Code | AI pair programming | Vibesafe enforces hash-locked reproducibility |
| ChatGPT API | On-demand generation | Vibesafe caches + verifies once, reuses in prod |
| OpenAPI codegen | Schema-driven templates | Vibesafe uses LLMs for flexible logic, not just boilerplate |
Here's vibesafe in action—no configuration, just code:
>>> import vibesafe
>>> from vibesafe import VibeCoded
>>> @vibesafe
... def cowsay(msg: str) -> str:
... """
... >>> cowsay("moo")
... 'moo'
... """
... raise VibeCoded()
...
>>> print(cowsay('moo'))
moo
\ ^__^
\ (oo)\_______
(__)\ )\/\
||----w |
|| ||That's it. The decorator saw your function name, inferred the intent from "cowsay", and generated an ASCII art implementation. Now let's see how to use it in a real project.
- Python 3.12+ (3.13 supported, 3.11 not tested)
- uv (recommended) or pip
- OpenAI-compatible API key (OpenAI, Anthropic with proxy, local LLM server)
- Claude Code (optional, for enhanced development experience)
# Clone the repo (for now; PyPI package coming soon)
git clone https://github.com/julep-ai/vibesafe.git
cd vibesafe
# Create virtual environment and install
uv venv
source .venv/bin/activate # On Windows: .venv\Scripts\activate
uv pip install -e ".[dev]"
# Verify installation
vibesafe --version
# or use the short alias:
vibe --versionTroubleshooting:
| Issue | Solution |
|---|---|
command not found: vibesafe |
Ensure .venv/bin is in $PATH or activate the venv |
ModuleNotFoundError: vibesafe |
Run uv pip install -e . from repo root |
Python 3.12 required |
Check python --version; install via python.org or package manager |
1. Configure your provider:
# Create vibesafe.toml in your project root
cat > vibesafe.toml <<EOF
[provider.default]
kind = "openai-compatible"
model = "gpt-4o-mini"
service_tier = "auto" # optional: auto|default|premium (provider-dependent)
api_key_env = "OPENAI_API_KEY"
EOF
# Set API key
export OPENAI_API_KEY="sk-..."2. Write a spec:
# examples/quickstart.py
from vibesafe import vibesafe, VibeCoded
@vibesafe
def greet(name: str) -> str:
"""
Return a greeting message.
>>> greet("Alice")
'Hello, Alice!'
>>> greet("世界")
'Hello, 世界!'
"""
raise VibeCoded()Optional: Claude Code Integration
If you use Claude Code, install the vibesafe plugin for enhanced development:
# In your Claude Code settings, add:
# plugin: /path/to/vibesafe/.claude-pluginThis gives you:
/vibecommands directly in Claude Code- Automatic vibesafe operations when reviewing code
- MCP server integration for seamless workflow
3. Generate + test:
# Compile the spec (calls LLM, writes checkpoint)
vibesafe compile --target examples.quickstart/greet
# Run verification (doctests + type check + lint)
vibesafe test --target examples.quickstart/greet
# Activate the checkpoint (marks it production-ready)
vibesafe save --target examples.quickstart/greet4. Use it:
# Import the function directly (decorator handles checkpoint loading)
from examples.quickstart import greet
print(greet("World")) # "Hello, World!"What just happened:
compileparsed your spec, rendered a prompt, called the LLM, and saved the implementation to.vibesafe/checkpoints/examples.quickstart/greet/<hash>/impl.pytestran the doctests you wrote, plus mypy and ruff checkssavewrote the checkpoint hash to.vibesafe/index.toml, activating it for runtime use- The
@vibesafedecorator loads from the active checkpoint transparently
Find all vibesafe units in your project:
vibesafe scan
# Output:
# Found 3 units:
# examples.math.ops/sum_str [2 doctests] ✓ checkpoint active
# examples.math.ops/fibonacci [4 doctests] ⚠ no checkpoint
# examples.api.routes/sum_endpoint [2 doctests] ✓ checkpoint activeCompile all units:
vibesafe compile
# Processes every @vibesafe-decorated function in the projectCompile specific module:
vibesafe compile --target examples.math.ops
# Only compiles functions in examples/math/ops.pyCompile single unit:
vibesafe compile --target examples.math.ops/sum_str
# Unit ID format: module.path/function_nameForce recompilation:
vibesafe compile --target examples.math.ops/sum_str --force
# Ignores existing checkpoint, generates fresh implementationWhat happens during compilation:
- AST parser extracts signature, docstring, pre-hole code
- Spec hash computed from signature + doctests + model config
- Prompt rendered via Jinja2 template (
vibesafe/templates/function.j2packaged in the library) - LLM generates implementation (cached by spec hash)
- Generated code validated (correct signature, compiles, no obvious errors)
- Checkpoint written to
.vibesafe/checkpoints/<unit>/<hash>/
Run doctest verification:
vibesafe test # Test all units
vibesafe test --target examples.math.ops # Test one module
vibesafe test --target examples.math.ops/sum_str # Test one unitWhat gets tested:
- ✅ Doctests extracted from spec docstring
- ✅ Type checking via mypy
- ✅ Linting via ruff
- ⏭️ Hypothesis property tests (if
hypothesis:fence in docstring) - ✅ In prod, an aggregated pytest harness per source module is materialized from doctests to expand coverage
Test output example:
Testing examples.math.ops/sum_str...
✓ Doctest 1/3 passed
✓ Doctest 2/3 passed
✓ Doctest 3/3 passed
✓ Type check passed (mypy)
✓ Lint passed (ruff)
[PASS] examples.math.ops/sum_str
Detect spec changes that invalidate checkpoints:
vibesafe diff # Check all units
vibesafe diff --target examples.math.ops/sum_str # Check one unitOutput:
[DRIFT] examples.math.ops/sum_str
Spec hash: 5a72e9... (current)
Checkpoint hash: 2d46f1... (active)
Spec changed:
- Added doctest example
- Modified parameter annotation: str -> int
Location: .vibesafe/checkpoints/examples.math.ops/sum_str/2d46f1.../
Action: Run `vibesafe compile --target examples.math.ops/sum_str`
Common drift causes:
- Changed function signature
- Added/removed/modified doctests
- Changed pre-hole code
- Updated model config (e.g.,
gpt-4o-mini→gpt-4o)
Activate a checkpoint (marks it production-ready):
vibesafe save --target examples.math.ops/sum_str
# Updates .vibesafe/index.toml with the checkpoint hashSave all units (only if all tests pass):
vibesafe save
# Fails if any unit has failing testsFreeze HTTP dependencies:
vibesafe save --target examples.api.routes/sum_endpoint --freeze-http-deps
# Writes requirements.vibesafe.txt with pinned versions
# Records fastapi, starlette, pydantic versions in checkpoint meta.tomlWhy freeze dependencies? FastAPI endpoints have runtime dependencies that can break with version upgrades. Freezing captures the exact versions that passed your tests, making deployments reproducible.
Get project-wide summary:
vibesafe status
# Output:
# Vibesafe Project Status
# =======================
#
# Units: 5 total
# ✓ 4 with active checkpoints
# ⚠ 1 missing checkpoints
# ⚠ 0 with drift
#
# Doctests: 23 total
# Coverage: 80% (4/5 units production-ready)
#
# Next steps:
# - Compile: examples.math.ops/is_prime| Command | Description | Key Options |
|---|---|---|
vibesafe scan |
List all specs and their status | --write-shims (deprecated) |
vibesafe compile |
Generate implementations | --target, --force |
vibesafe test |
Run verification (doctests + gates) | --target |
vibesafe save |
Activate checkpoints | --target, --freeze-http-deps |
vibesafe diff |
Show drift between spec and checkpoint | --target |
vibesafe status |
Project overview | |
vibesafe check |
Bundle lint + type + test + drift checks | --target |
vibesafe repl |
Interactive iteration loop (Phase 2) | --target |
Aliases: vibesafe and vibe are interchangeable.
[project]
python = ">=3.12" # Minimum Python version
env = "dev" # "dev" or "prod" (overridden by VIBESAFE_ENV)
[provider.default]
kind = "openai-compatible"
model = "gpt-4o-mini" # Model name
seed = 42 # Random seed for reproducibility
reasoning_effort = "medium" # optional: minimal|low|medium|high
service_tier = "auto" # optional: pass through to provider tiering
base_url = "https://api.openai.com/v1"
api_key_env = "OPENAI_API_KEY" # Environment variable name
timeout = 60 # Request timeout (seconds)
[paths]
checkpoints = ".vibesafe/checkpoints" # Where implementations are stored
cache = ".vibesafe/cache" # LLM response cache (gitignored)
index = ".vibesafe/index.toml" # Active checkpoint registry
generated = "__generated__" # Import shim directory (deprecated)
[prompts]
function = "vibesafe/templates/function.j2" # Template for @vibesafe
http = "vibesafe/templates/http_endpoint.j2" # Template for @vibesafe(kind="http")
[sandbox]
enabled = false # Run tests in isolated subprocess (Phase 1)
timeout = 10 # Test timeout (seconds)
memory_mb = 256 # Memory limit (not enforced yet)@vibesafe
@vibesafe(
provider: str = "default", # Provider name from vibesafe.toml
template: str = "vibesafe/templates/function.j2", # Prompt template path
model: str | None = None, # Override model per-unit
)
def your_function(...) -> ...:
"""
Docstring should include doctests; missing examples emit a warning.
>>> your_function(...)
expected_output
"""
# Optional pre-hole code (e.g., validation, parsing)
raise VibeCoded()@vibesafe(kind="http")
@vibesafe(
kind="http",
method: str = "GET", # HTTP method
path: str = "/endpoint", # Route path
tags: list[str] = [], # OpenAPI tags
provider: str = "default",
template: str = "vibesafe/templates/http_endpoint.j2",
model: str | None = None,
)
async def your_endpoint(...) -> ...:
"""
Endpoint description with doctests.
>>> import anyio
>>> anyio.run(your_endpoint, arg1, arg2)
expected_output
"""
raise VibeCoded()| Exception | Cause | Remedy |
|---|---|---|
VibesafeMissingDoctest |
Spec lacks doctest examples | Add >>> examples to docstring |
VibesafeValidationError |
Generated code fails structural checks | Tighten spec (more examples, clearer docstring) |
VibesafeProviderError |
LLM API failure (timeout, auth, rate limit) | Check API key, network, quota |
VibesafeHashMismatch |
Spec changed but checkpoint is stale | Run vibesafe compile to regenerate |
VibesafeCheckpointMissing |
Prod mode but no active checkpoint | Run vibesafe compile + vibesafe save |
┌─────────────────────────────────────────────────────────────────┐
│ Developer writes spec: │
│ @vibesafe │
│ def sum_str(a: str, b: str) -> str: │
│ """>>> sum_str("2", "3") → '5'""" │
│ raise VibeCoded() │
└─────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ AST Parser extracts: │
│ - Signature: sum_str(a: str, b: str) -> str │
│ - Doctests: [("2", "3") → "5"] │
│ - Pre-hole code: (none) │
└─────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ Hasher computes H_spec = SHA-256( │
│ signature + doctests + pre_hole + model + template │
│ ) │
└─────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ Prompt Renderer (Jinja2): │
│ - Loads vibesafe/templates/function.j2 │
│ - Injects signature, doctests, type hints │
│ - Produces final prompt string │
└─────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ Provider calls LLM: │
│ - Checks cache: .vibesafe/cache/<H_spec>.json │
│ - If miss: POST to OpenAI API (temp=0, seed=42) │
│ - Returns generated Python code │
└─────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ Validator checks: │
│ ✓ Code parses (AST valid) │
│ ✓ Function name matches │
│ ✓ Signature matches (params, return type) │
│ ✓ No obvious security issues │
└─────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ Checkpoint Writer: │
│ - Computes H_chk = SHA-256(H_spec + prompt + code) │
│ - Writes .vibesafe/checkpoints/<unit>/<H_chk>/impl.py │
│ - Writes meta.toml (spec hash, timestamp, model, versions) │
└─────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ Test Harness runs: │
│ 1. Doctests (pytest wrappers) │
│ 2. Type check (mypy) │
│ 3. Lint (ruff) │
│ Result: PASS or FAIL │
└─────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ If tests pass, developer runs: │
│ vibesafe save --target <unit> │
│ │
│ Writes to .vibesafe/index.toml: │
│ [<unit>] │
│ active = "<H_chk>" │
│ created = "2025-10-30T12:34:56Z" │
└─────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ Runtime: Direct function import │
│ from examples.math import sum_str │
│ │
│ Decorator calls: load_checkpoint("examples.math/sum_str") │
│ 1. Read .vibesafe/index.toml for active hash │
│ 2. Load .vibesafe/checkpoints/<unit>/<hash>/impl.py │
│ 3. In prod mode: verify H_spec matches checkpoint meta │
│ 4. Return the function object │
└─────────────────────────────────────────────────────────────────┘
Vibesafe uses a pluggable provider system. Phase 1 ships with openai-compatible, which works with:
- OpenAI (GPT-4o, GPT-4o-mini)
- Anthropic (via OpenAI-compatible proxy)
- Local LLMs (llama.cpp, vLLM, Ollama with OpenAI API)
Provider interface:
class Provider(Protocol):
def complete(
self,
prompt: str,
system: str | None = None,
seed: int = 42,
temperature: float = 0.0,
max_tokens: int | None = None,
**kwargs
) -> str:
"""Return generated code as string."""Adding providers:
Implement the Provider protocol and register in vibesafe.toml:
[provider.anthropic]
kind = "anthropic-native"
model = "claude-3-5-sonnet-20250131"
api_key_env = "ANTHROPIC_API_KEY"Dev mode (env = "dev"):
- Import triggers
load_active(unit_id) - Read
.vibesafe/index.tomlfor active checkpoint hash - Compute current spec hash
H_spec - If
H_spec≠ checkpoint's spec hash:- Warn: "Spec drift detected, regenerating..."
- Auto-run
vibesafe compile --target <unit> - Load new checkpoint
- Return function object
Prod mode (env = "prod" or VIBESAFE_ENV=prod):
- Import triggers
load_active(unit_id) - Read
.vibesafe/index.tomlfor active checkpoint hash - If no checkpoint: raise
VibesafeCheckpointMissing - Load checkpoint metadata from
meta.toml - Compute current spec hash
H_spec - If
H_spec≠ checkpoint's spec hash: raiseVibesafeHashMismatch - Return function object
This enforces:
- ✅ What you tested is what runs (no silent regeneration)
- ✅ Drift is caught before deployment
- ✅ Reproducibility across environments
1. CI/CD gating:
# .github/workflows/ci.yml
jobs:
vibesafe-check:
runs-on: ubuntu-latest
steps:
- run: vibesafe diff
# Fails if any unit has drifted
- run: vibesafe test
# Runs all doctests + type/lint gates
- run: vibesafe save --dry-run
# Verifies all checkpoints existIn 6 months of use, this caught 23 unintended spec changes (typos in doctests, accidental signature edits) before merge.
2. Frozen HTTP dependencies:
# Before deploying FastAPI app
vibesafe save --target api.routes --freeze-http-deps
git add requirements.vibesafe.txt .vibesafe/checkpoints/
git commit -m "Lock FastAPI endpoint dependencies"The meta.toml records:
[deps]
fastapi = "0.115.2"
starlette = "0.41.2"
pydantic = "2.9.1"Now your containerized deployment uses the exact versions that passed tests, preventing "works on my laptop" bugs.
3. Prompt regression coverage:
Every time you change a spec, the hash changes. This creates a natural test suite for prompt engineering:
# After editing vibesafe/templates/function.j2
vibesafe compile --force # Regenerate all units
vibesafe test # Verify all doctests still pass
vibesafe diff # Review generated code changesIf a prompt change breaks existing specs, doctests fail immediately. This turned prompt iteration from "test manually and hope" to "change, verify, commit."
4. Local agents + vibesafe.toml contract:
The vibesafe.toml file is the single source of truth for:
- Which model to use
- What temperature/seed settings
- Where checkpoints live
- Which prompt templates apply
Local AI coding agents (Claude Code, Cursor, GitHub Copilot) can read vibesafe.toml and understand the contract without asking the developer. Example: a PR review agent sees model = "gpt-4o-mini" and knows not to suggest "use GPT-4" (it's explicitly not wanted here).
The examples/ directory doubles as regression fixtures:
$ tree examples/
examples/
├── math/
│ └── ops.py # sum_str, fibonacci, is_prime
└── api/
└── routes.py # sum_endpoint, hello_endpoint
$ vibesafe test --target examples.math.ops
✓ sum_str [3 doctests]
✓ fibonacci [4 doctests]
✓ is_prime [5 doctests]
[PASS] 3/3 unitsThese examples serve three purposes:
- Documentation: Show real usage patterns
- Testing: Verify vibesafe's own codegen pipeline
- Fixtures: Golden tests for prompt/model changes
| Feature | Status | Notes |
|---|---|---|
| Python 3.12+ support | ✅ | Tested on 3.12, 3.13 |
@vibesafe decorator |
✅ | Function and endpoint generation |
kind parameter |
✅ | Supports "function", "http", "cli" |
| Doctest verification | ✅ | Auto-extracted from docstrings |
| Type checking (mypy) | ✅ | Mandatory gate before save |
| Linting (ruff) | ✅ | Enforces style consistency |
| Hash-locked checkpoints | ✅ | SHA-256 content addressing |
| Drift detection | ✅ | vibesafe diff command |
| OpenAI-compatible providers | ✅ | Works with OpenAI, proxies, local LLMs |
CLI (scan, compile, test, save, status, diff, check) |
✅ | vibesafe or vibe alias |
| Dependency freezing | ✅ | --freeze-http-deps flag |
| Jinja2 prompt templates | ✅ | Customizable via vibesafe.toml |
| LLM response caching | ✅ | Keyed by spec hash, speeds up iteration |
| Subprocess sandbox | ✅ | Optional isolation for test runs |
| Claude Code Plugin | ✅ | Full integration with Claude Code |
| MCP Server | ✅ | Model Context Protocol server |
| GitHub Actions | ✅ | Automated Claude Code reviews |
Current coverage: 150+ checkpointed functions across 3 internal projects, 95% test coverage for vibesafe core.
Phase 2 (In Progress) — See ROADMAP.md
- Interactive REPL (
vibesafe repl --target <unit>)- Commands:
gen,tighten,diff,save,rollback - Planned Q2 2025
- Commands:
- Property-based testing (Hypothesis integration)
- Extract
hypothesis:fences from docstrings - Auto-generate property tests
- Extract
- Multi-provider support (Anthropic native, Gemini, local inference)
- Advanced dependency tracing (hybrid static + runtime)
- Web UI dashboard (checkpoint browser, diff viewer)
- Sandbox enhancements (network/FS isolation, resource limits)
- PyPI package release (
pip install vibesafe) - Documentation site (Docusaurus on GitHub Pages)
- VS Code extension (syntax highlighting for
@vibesafespecs) - Performance benchmarks (compilation time, test throughput)
- Migration guide (v0.1 → v0.2)
Contributions welcome! Please:
- Open an issue first for features/bugs
- Follow the spec in SPEC.md
- Add tests for new functionality
- Update TODOS.md if you complete a roadmap item
Development setup:
git clone https://github.com/julep-ai/vibesafe.git
cd vibesafe
uv venv && source .venv/bin/activate
uv pip install -e ".[dev]"
# Run tests
pytest -n auto
# Type check
mypy src/vibesafe
# Lint
ruff check src/ tests/ examples/
# Format
ruff format src/ tests/ examples/Claude Code Integration: This repo includes a full Claude Code plugin with:
- MCP server for seamless vibesafe operations
- Slash commands (
/vibe,/vibe-init,/vibe-mode,/vibe-status) - Automated PR reviews and test failure analysis
- Skills for AI-assisted development workflows
See .claude-plugin/ for plugin configuration and .github/workflows/ for CI automation.
- ✅ Iteration speed: Dev mode auto-regenerates on import, no manual compile step
- ✅ Reproducibility: Same spec = same hash = same code
- ✅ Testability: Doctests are mandatory, enforced at save time
- ✅ Prod safety: Hash mismatches block execution, preventing drift
- ❌ Complex state machines: Specs are per-function, not multi-step workflows (use orchestration layer)
- ❌ Dynamic prompt injection: Templates are static Jinja2, not runtime-constructed (by design, for reproducibility)
- ❌ Multi-language support: Python-only (Rust/TypeScript on roadmap if demand exists)
- ❌ GUI for non-coders: CLI-first tool, requires Python knowledge
- Exploratory prototyping: If you're not sure what the API should be, write it manually first
- Performance-critical code: LLM-generated implementations may not be optimally optimized (profile before deploying)
- Regulatory/compliance code: Review generated code manually; vibesafe ensures reproducibility, not correctness
- Sub-second latency requirements: Checkpoint loading adds ~10ms overhead on first import
MIT — see LICENSE
Built with:
- uv — Fast Python package manager
- ruff — Fast Python linter
- mypy — Static type checker
- pytest — Testing framework
- Jinja2 — Prompt templating
Inspired by:
- Defunctionalization (Reynolds, 1972) — Making implicit control explicit
- Content-addressed storage (Git, Nix) — Deterministic builds via hashing
- Test-driven development — Specs as executable contracts
- Literate programming (Knuth) — Code that explains itself
- Issues: github.com/julep-ai/vibesafe/issues
- Discussions: github.com/julep-ai/vibesafe/discussions
- Email: [email protected]