Analysis of pullfrog/pullfrog surfaced several patterns worth adopting in Warden. These are the most concrete, high-impact takeaways from a deep codebase comparison.
Multi-lens parallel review orchestration
Pullfrog dispatches independent read-only subagents per review lens (correctness, security, user-journey, performance, etc.) in parallel, then aggregates and de-dups findings. Warden currently runs one skill at a time per hunk. Coordinated multi-skill reviews with finding aggregation across lenses could improve both recall and coherence.
- Parallel subagent fan-out per lens with independent context discovery
- Orchestrator-level aggregation: overlapping findings from multiple lenses are a strong signal
- Lens selection is adaptive based on PR triage (domain, seams, external contracts touched)
Cross-run PR context (rolling summary snapshots)
Pullfrog maintains a rolling PR summary file that persists across re-review runs. Each run reads the previous snapshot, uses it to inform triage and lens selection, then updates it with the PR's current state. This gives incremental reviews cumulative memory instead of starting cold.
- Seeds a tmpfile with the previous snapshot (fetched from API)
- Agent reads it at run start alongside the diff
- Agent updates it in place; persisted server-side at run end
- Prevents re-flagging resolved issues and surfaces new risks from new commits
Repo-level learnings persistence
Pullfrog seeds a repo-level learnings file each run. The agent can record patterns it discovers (e.g. "this repo uses X convention for Y") and reference them in future runs. Warden's skill execution is stateless — each run starts with zero repo-specific context beyond what the skill prompt and diff provide.
Incremental range-diff for re-reviews
Pullfrog tracks beforeSha on pull_request_synchronize events and generates incremental range-diffs scoped to new commits. Warden's review-state tracking could be enriched with similar range-diff intelligence to avoid re-analyzing unchanged hunks.
Structured fix quality validation
Pullfrog's Fix mode runs tests and self-reviews before pushing. On the Warden side, the suggestedFix pipeline already validates diff application — but extending this to semantic validation (does the fix actually address the finding?) and test-awareness would improve fix acceptance rates.
Context: deep comparison documented in this Slack canvas.
Action taken on behalf of David Cramer.
Analysis of pullfrog/pullfrog surfaced several patterns worth adopting in Warden. These are the most concrete, high-impact takeaways from a deep codebase comparison.
Multi-lens parallel review orchestration
Pullfrog dispatches independent read-only subagents per review lens (correctness, security, user-journey, performance, etc.) in parallel, then aggregates and de-dups findings. Warden currently runs one skill at a time per hunk. Coordinated multi-skill reviews with finding aggregation across lenses could improve both recall and coherence.
Cross-run PR context (rolling summary snapshots)
Pullfrog maintains a rolling PR summary file that persists across re-review runs. Each run reads the previous snapshot, uses it to inform triage and lens selection, then updates it with the PR's current state. This gives incremental reviews cumulative memory instead of starting cold.
Repo-level learnings persistence
Pullfrog seeds a repo-level learnings file each run. The agent can record patterns it discovers (e.g. "this repo uses X convention for Y") and reference them in future runs. Warden's skill execution is stateless — each run starts with zero repo-specific context beyond what the skill prompt and diff provide.
Incremental range-diff for re-reviews
Pullfrog tracks
beforeShaonpull_request_synchronizeevents and generates incremental range-diffs scoped to new commits. Warden's review-state tracking could be enriched with similar range-diff intelligence to avoid re-analyzing unchanged hunks.Structured fix quality validation
Pullfrog's Fix mode runs tests and self-reviews before pushing. On the Warden side, the
suggestedFixpipeline already validates diff application — but extending this to semantic validation (does the fix actually address the finding?) and test-awareness would improve fix acceptance rates.Context: deep comparison documented in this Slack canvas.
Action taken on behalf of David Cramer.