Skip to content

scienceaix/awesome-harness-engineering

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 

Repository files navigation

Awesome Harness Engineering 🛠️

Awesome PRs Welcome Stars

A curated, comprehensive list of tools, frameworks, research papers, and enterprise playbooks for Harness Engineering—the discipline of designing deterministic environments, strict constraints, and automated feedback loops that make AI agents reliable and autonomous.

The model is the commodity; the harness is the moat.

Contents


Core Runtimes & Agentic OS

Tools that provide the physical scaffolding, memory management, and execution orchestration for autonomous agents.

  • IHR (In-Loop Harness Runtime) - An in-loop LLM runtime designed to directly interpret Natural-Language Agent Harnesses (NLAHs), cleanly separating execution from logic.
  • OpenDev - Advanced harness patterns optimized strictly for Levels 1-2 verifiable coding tasks.
  • OpenClaw - (Use with extreme caution) The viral, local-first AI agent runtime that highlighted the critical need for strict harness security and multi-user isolation.
  • NemoClaw - Nvidia's enterprise-hardened fork of OpenClaw featuring kernel-level sandboxing, deterministic routers, and zero-trust execution.
  • Playwright MCP - The Model Context Protocol implementation used by Anthropic for deterministic, adversarial QA and UI evaluation within a multi-agent harness loop.

Evaluation & Benchmarking

Systems to empirically measure agent capability, alignment, architectural compliance, and cognitive progression.

  • EleutherAI LM Evaluation Harness (v0.4.9.1) - The industry standard. Features support for multi-modal tasks, vLLM acceleration, and advanced regex post-processing for Llama/Qwen/Gemma evaluations.
  • PostTrainBench - Evaluates whether agents can autonomously post-train base LLMs against 7 strict benchmarks within fixed H100 compute constraints.
  • Stripe Agent Benchmark - End-to-end benchmark for agents executing real-world cross-domain glue work, API integration, and database state management.
  • DeepEval - Platform for AI quality observability, automated human feedback collection, and A/B testing of harness performance and token efficiency.

Enterprise Playbooks & Specifications

How organizations like Stripe, Shopify, and Block actually deploy agents to production without breaking their architecture.

  • agents.md - The universal standard for creating machine-readable repository documentation to govern agent behavior, style preferences, and CI/CD rules.
  • FairMind 72 Criteria Checklist - An open-source breakdown of the 7 dimensions and 72 practices required to achieve Level 5 harness maturity in a corporate environment.
  • Factory Model Playbook - Playbook for transitioning engineering teams from writing syntax to managing multi-agent orchestration (Feature Authors, Test Generators, Architecture Guardians).
  • Anthropic Harness Design Patterns - Implementation guides for building adversarial Planner/Generator/Evaluator architectures equipped with context-reset mechanisms.

Security & Deterministic Guardrails

Probabilistic guardrails will inevitably fail. These tools enforce hard mathematical and cryptographic boundaries.

  • Deterministic Interceptors - Middle-layer pre-tool and post-tool execution hooks to sanitize data (regex scanning, base64 decoding) before it enters the agent's context window.
  • Session Isolation Toolkit - A lab topology toolkit to aggressively test and expose ingress vs. execution identity bleed (CVE-2026-27183) in multi-agent routers.
  • Prompt Guard (Meta) - Robust, lightweight input/output classification models to prevent unauthorized tool execution and system prompt overriding.

Cutting-Edge Research Papers


Contributing

Contributions are welcome! Please submit a Pull Request to add new tools, specifications, or cutting-edge academic papers that advance the discipline of Harness Engineering.

About

Awesome Harness Engineering -- A curated, comprehensive list of tools, frameworks, research papers, and enterprise playbooks for Harness Engineering — the discipline of designing deterministic environments, strict constraints, and automated feedback loops that make AI agents reliable and autonomous.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors