🐧 中文網頁版 — penchan.co/ai/introducing-deep-review
From articles to architecture decisions
Mission: Give AI agents a research methodology — a structured way to evaluate external resources before deciding what to adopt. Not copying, but learning with discipline.
You read a great article. New tips, better workflows, smarter prompts. But should you actually change anything?
deep-review is a skill for Claude Code that answers this question. Instead of going with your gut, it runs each recommendation through a structured pipeline and gives you a clear verdict: adopt, experiment, reject, or needs discussion.
- Copy
deep-review.mdinto your project or~/.claude/skills directory - Say
deep-reviewand paste the article - Get a structured analysis with clear, actionable decisions
The skill file is just a structured prompt. You can adapt it for Cursor, Windsurf, or any AI assistant that reads markdown instructions.
We all do this:
- Read an exciting article
- Think "this is brilliant, I should use this"
- Either adopt everything (and bloat the system) or do nothing (and forget it)
The issue isn't the articles — it's that we skip the analysis. We get swayed by who wrote it, how new it sounds, or the urge to "do something." deep-review adds the thinking step you'd do if you had unlimited time and patience.
Six phases. Each one builds on the last.
article --> FILTER --> EXTRACT --> DIFF --> ARGUE --> DECIDE --..-> AUDIT --> result
| ^
+-- exit: not our problem subagent
| Phase | What happens |
|---|---|
| 0. Filter | "Do we even have this problem?" If not, stop here. |
| 1. Extract | Break the article into individual claims. Tag each one: data, case study, logic, or opinion. |
| 2. Diff | Compare each claim to what your system already does. Pull up the actual files. |
| 3. Argue | For each claim: the case for, the case against. Cost, risk, missing info. |
| 4. Decide | One decision card per claim. Adopt, experiment, reject, or flag for discussion. |
| 5. Audit | Independent check for blind spots — runs as a separate agent call to avoid self-review bias. |
When an AI evaluates its own output in the same breath, it almost always says "looks good." Research shows this kind of self-review has near-zero discriminative power. Running the audit as a separate call fixes this.
Why no role-play? Many prompts use personas like "Architect" and "Skeptic" debating each other. This doesn't actually work in a single generation — the AI can't reason independently for each role. We use structured questions instead.
Why no scores? Self-assigned scores (7/10, 85%) sound precise but are unreliable. The audit checks for specific failure patterns instead — like "all claims adopted" or "no counter-arguments given."
Why Phase 0? Most articles solve problems you don't have. Catching this early saves tokens and prevents unnecessary changes. "Do nothing" is a valid outcome.
- CheckEval — Why checklists beat open-ended scoring
- LLM-as-Judge research — Known biases and how to counter them
- Multi-agent debate studies — Why AI "debates" often make things worse
- Heilmeier Catechism — DARPA's method for vetting proposals
- Architecture Decision Records — How engineering teams document decisions that stick
- After each review, note what you actually adopted vs. skipped
- Every 5-10 reviews, look for patterns in misjudged claims
- Tweak the prompt — one change at a time, test it, keep or revert
- Track versions in the file header
This follows the autoresearch philosophy: small, measured improvements — not wholesale rewrites.
MIT
宗旨: 讓 AI Agent 擁有一套研究方法論——快速且全面地評估外部資源, 再決定要不要採用。不是照抄,而是有紀律地學習。
你讀了一篇好文章。新技巧、更好的工作流程、更聰明的 prompt。 但你真的應該改什麼嗎?
deep-review 是一個 Claude Code 的 skill,幫你回答這個問題。不靠直覺, 而是把每個建議丟進結構化的 pipeline,給你明確的判定:採用、實驗、 拒絕、或需要討論。
- 把
deep-review.md複製到你的專案目錄或~/.claude/skills 目錄 - 輸入
deep-review,貼上文章 - 得到結構化分析和明確的可執行決策
skill 檔案就是一份結構化 prompt。你可以改寫後用於 Cursor、Windsurf, 或任何能讀 markdown 指令的 AI 助手。
我們都做過這件事:
- 讀到一篇讓人興奮的文章
- 心想「太厲害了,我應該用這個」
- 然後要嘛全盤照收(讓系統變臃腫),要嘛什麼都不做(然後忘了)
問題不在文章本身——而是我們跳過了分析。deep-review 補上你在時間精力無限時 會做的那一步思考。
六個階段,環環相扣。
文章 --> 過濾 --> 提取 --> 比對 --> 論辯 --> 決策 --..-> 審計 --> 結果
| ^
+-- 出口:跟我們無關 subagent
| 階段 | 做什麼 |
|---|---|
| 0. 過濾 | 「我們真的有這個問題嗎?」沒有就直接結束。 |
| 1. 提取 | 把文章拆成獨立的主張。標記證據類型:數據、案例、邏輯、或觀點。 |
| 2. 比對 | 把每條主張和系統現狀對照。打開實際檔案——不能含糊帶過。 |
| 3. 論辯 | 針對每條主張列出支持與反對。成本、風險、缺什麼資訊。 |
| 4. 決策 | 每條主張一張決策卡。採用、實驗、拒絕、或待討論。 |
| 5. 審計 | 獨立的盲點檢查——以獨立的 agent 呼叫執行,避免自我審查偏誤。 |
AI 在同一次生成中評估自己的輸出時,幾乎總是說「看起來不錯」。 研究顯示這種自我審查的辨別力趨近於零。 將審計作為獨立呼叫執行,解決這個問題。
這裡的審計不是對來源文章做嚴格的真偽查核——deep-review 的核心態度是學習, 不是複製。Phase 5 審計的是分析過程本身的品質:有沒有盲點、偏見、或草率判斷。
為什麼不用角色扮演? 很多 prompt 讓「架構師」和「懷疑論者」辯論。但在單次生成中, 這其實不管用——AI 無法為每個角色獨立推理。 我們改用結構化提問。
為什麼不打分? 自評分數(7/10、85%)聽起來精確但不可靠。審計改為檢查特定的失敗模式—— 比如「所有主張都被採用」或「沒有反對意見」。
為什麼有 Phase 0? 多數文章解決的問題你根本沒有。及早發現省下 token,避免不必要的改動。 「什麼都不做」是完全合理的結果。
- CheckEval — 為什麼 checklist 比開放式評分更好
- LLM-as-Judge 研究 — 已知偏見與應對方式
- 多 Agent 辯論研究 — 為什麼 AI「辯論」常常適得其反
- Heilmeier Catechism — DARPA 的提案評估方法
- Architecture Decision Records — 工程團隊如何記錄可持續的決策
- 每次 review 後,記下實際採用了什麼、跳過了什麼
- 每 5-10 次 review,找規律——哪些類型的主張容易判斷失誤?
- 微調 prompt——一次改一個地方,測試,保留或回退
- 在檔案開頭記錄版本
遵循 autoresearch 的哲學: 小幅度、可衡量的改善——而非大規模重寫。
MIT

