A systematic workflow and Python toolkit for verifying manuscript citations against source PDFs. Uses a Claim-Grounded Extraction pipeline to minimize LLM hallucination, producing a comprehensive Excel report with citation accuracy verdicts, writing suggestions, and anticipated reviewer challenges.
一套系統化的工作流程與 Python 工具,用於比對論文引用與原始 PDF 來源。採用 Claim-Grounded Extraction 管線以降低 LLM 幻覺,產出包含引用準確性判定、寫作建議及預期審稿意見的完整 Excel 報告。
Unlike generic RAG, this tool uses a pipeline specifically designed for citation fact-checking:
有別於通用 RAG,本工具使用專為引用事實查核設計的管線:
[claim sentence | 宣稱句]
|
Step 1: Extract verifiable assertions (rule-based)
擷取可驗證的斷言(基於規則)
e.g., "sample size = 120", "effect size = 0.4", "p < 0.05"
|
Step 2: Hybrid retrieval — BM25 + semantic + keyword search in PDF
混合檢索 — BM25 + 語意 + 關鍵字搜尋 PDF
|
Step 3: Send ONLY retrieved passages to LLM for verdict
僅將檢索到的段落送交 LLM 判定
with grounding requirement: "quote the exact sentence"
|
Step 4: Failsafe — if no passage found -> verdict = "Not Found in PDF"
安全機制 — 若未找到段落 → 判定為「PDF 中未找到」
(never let the LLM guess | 絕不讓 LLM 猜測)
Key benefits | 主要優勢:
- Hallucination is contained | 幻覺受到控制 — LLM only sees relevant passages, not full text dumps | LLM 僅看到相關段落,非全文
- Numbers/statistics are found precisely | 精確找到數據與統計值 — BM25 + regex beats semantic embeddings for quantitative claims | BM25 + 正則表達式在量化宣稱上優於語意嵌入
- Hybrid retrieval catches paraphrases | 混合檢索捕捉改寫內容 — Semantic search finds "mortality decreased" when claim says "survival improved" | 語意搜尋能找到改寫後的對應內容
- Failsafe prevents fabrication | 安全機制防止捏造 — if nothing is found, the system says so honestly | 若未找到,系統誠實回報
- Cost-efficient | 成本效益高 — cheap keyword search first, expensive LLM only on retrieved chunks | 先用低成本關鍵字搜尋,再用 LLM 處理檢索結果
Given a manuscript (.docx) and its reference PDFs (paper_1.pdf ... paper_N.pdf), this tool:
給定論文稿件(.docx)及其引用 PDF(paper_1.pdf ... paper_N.pdf),本工具:
- Extracts | 擷取 all in-text citations and maps them to reference entries | 所有文內引用並對應至參考文獻
- Structures | 結構化 each PDF by detecting sections (Abstract, Results, Conclusions, etc.) | 偵測各 PDF 章節
- Decomposes | 分解 compound claims ("A, B, and C [ref]") into individual verifiable assertions | 將複合宣稱拆分為單一可驗證斷言
- Retrieves | 檢索 relevant passages using hybrid BM25 + semantic + keyword search | 使用混合搜尋找到相關段落
- Verifies | 驗證 whether each claim is supported, with failsafe for missing evidence | 判定各宣稱是否有據支持
- Generates | 產出 a multi-sheet Excel report containing: | 多工作表 Excel 報告:
- Per-section citation fact-checks (Introduction, Methods, Discussion, etc.) | 按章節引用查核
- Manuscript structural/formatting errors | 稿件結構/格式錯誤
- Writing improvement suggestions with recommended additional references | 寫作改善建議與推薦引用
- Anticipated reviewer challenges with severity ratings and suggested responses | 預期審稿意見與建議回應
- Color-coded verdict legend | 顏色編碼判定圖例
| Verdict | Meaning | 含義 |
|---|---|---|
| Correct | Reference accurately supports the stated claim | 引用準確支持宣稱 |
| Modify Text | Reference is real but claim overstates or misrepresents findings | 引用存在但宣稱誇大或誤述 |
| Weak Ref / Replace | Reference is tangentially relevant (protocol paper, wrong population, etc.) | 引用相關性薄弱 |
| Wrong Citation / Wrong File | PDF mismatch, wrong file, or paper doesn't support claim at all | PDF 不符或完全不支持宣稱 |
| Not Found in PDF | No relevant passage located — claim cannot be verified against this source | PDF 中未找到相關段落 |
| Manuscript Error | Structural, formatting, or completeness issue (not citation content) | 稿件結構/格式問題 |
pip install -r requirements.txtOptional dependencies for enhanced features | 可選依賴(進階功能):
# Semantic search (local) | 語意搜尋(本地)
pip install sentence-transformers numpy
# CJK tokenization (Chinese/Japanese/Korean) | 中日韓分詞
pip install jiebaCitation_FactCheck/
citation_factcheck.py
manuscript/ <- put your .docx here | 放置 .docx 稿件
Manuscript.docx
ref_pdf/ <- put your reference PDFs here | 放置參考文獻 PDF
paper_1.pdf
paper_2.pdf
...
paper_N.pdf
- PDF naming | PDF 命名:
paper_X.pdfwhere X matches the reference number in the manuscript | X 對應稿件中的引用編號 - Manuscript | 稿件: Standard
.docxwith numbered in-text citations | 含有編號引用的標準.docx
# Open terminal in the project folder, then:
# 在專案資料夾開啟終端機,然後:
claude
# In Claude Code, paste the prompt from WORKFLOW.md
# 在 Claude Code 中貼上 WORKFLOW.md 的提示Claude Code will use the Claim-Grounded Extraction pipeline to:
Claude Code 將使用 Claim-Grounded Extraction 管線:
- Extract verifiable assertions from each citation claim | 從每個引用宣稱中擷取可驗證斷言
- Search PDFs using hybrid retrieval (BM25 + semantic + keywords) | 使用混合檢索搜尋 PDF
- Decompose compound claims for independent verification | 分解複合宣稱進行獨立驗證
- Generate verdicts only from retrieved passages | 僅從檢索段落產出判定
- Apply failsafe for unverifiable claims | 對無法驗證的宣稱套用安全機制
# Basic (BM25 only) | 基本(僅 BM25)
python citation_factcheck.py \
--manuscript manuscript/Manuscript.docx \
--pdf-dir ref_pdf \
--output Citation_FactCheck_Report.xlsx
# With semantic search via OpenRouter | 搭配 OpenRouter 語意搜尋
python citation_factcheck.py \
--manuscript manuscript/Manuscript.docx \
--pdf-dir ref_pdf \
--output Citation_FactCheck_Report.xlsx \
--openrouter-key YOUR_API_KEY
# Or set the environment variable | 或設定環境變數
export OPENROUTER_API_KEY=your_key_here
python citation_factcheck.py \
--manuscript manuscript/Manuscript.docx \
--pdf-dir ref_pdf \
--output Citation_FactCheck_Report.xlsxpip install -r requirements.txtRequired | 必要:
| Package | Purpose | 用途 |
|---|---|---|
python-docx |
Read .docx manuscript files |
讀取 .docx 稿件 |
PyMuPDF (fitz) |
Extract text from PDF with section detection | 從 PDF 擷取文字與章節偵測 |
openpyxl |
Generate formatted Excel reports | 產出格式化 Excel 報告 |
jieba |
CJK (Chinese) word segmentation | 中文分詞 |
Optional | 可選:
| Package | Purpose | 用途 |
|---|---|---|
sentence-transformers |
Local semantic search (no API needed) | 本地語意搜尋(無需 API) |
numpy |
Required by sentence-transformers | sentence-transformers 依賴 |
The hybrid retrieval system tries multiple strategies in order:
混合檢索系統依序嘗試多種策略:
1. OpenRouter API (--openrouter-key or OPENROUTER_API_KEY)
└─ Uses text-embedding-3-small, requires API key
使用 text-embedding-3-small,需要 API 金鑰
2. sentence-transformers (local)
└─ Uses all-MiniLM-L6-v2 (~80MB), no API needed
使用 all-MiniLM-L6-v2(約 80MB),無需 API
3. BM25 only (always available)
└─ Keyword-based retrieval, no extra dependencies
基於關鍵字檢索,無額外依賴
Combines keyword-based BM25 with semantic similarity using Reciprocal Rank Fusion (RRF). This catches paraphrased content that BM25 alone would miss (e.g., "mortality decreased" vs. "survival improved").
結合基於關鍵字的 BM25 與語意相似度,使用 Reciprocal Rank Fusion (RRF) 融合排序。能捕捉 BM25 會遺漏的改寫內容(例如「死亡率降低」對應「存活率提升」)。
Auto-detects CJK (Chinese/Japanese/Korean) content and switches tokenizers:
自動偵測中日韓(CJK)內容並切換分詞器:
- English: regex-based tokenizer with stopword removal | 基於正則的分詞器與停用詞過濾
- CJK: jieba word segmentation (falls back to character-level if not installed) | jieba 分詞(未安裝時退回字元級別)
- Mixed text: dual tokenization for both CJK and English parts | 雙重分詞處理中英混合文本
Automatically splits compound claims into individual verifiable assertions:
自動將複合宣稱拆分為單一可驗證斷言:
Input: "A increased mortality, B reduced infection, and C improved outcomes [1]"
Output: ["A increased mortality", "B reduced infection", "C improved outcomes"]
Each sub-claim is verified independently against the PDF.
每個子宣稱獨立對照 PDF 驗證。
Groups claims by reference number so the same PDF context is reused across multiple claims. When used with Claude Code, this enables prompt caching for ~90% cost savings after the first claim per reference.
按引用編號分組宣稱,使同一 PDF 上下文可重複使用。搭配 Claude Code 使用時,可啟用提示快取,在每個引用的首次宣稱後節省約 90% 成本。
| Option | Default | Description | 說明 |
|---|---|---|---|
--manuscript, -m |
(required) | Path to .docx manuscript |
稿件 .docx 路徑 |
--pdf-dir, -d |
ref_pdf |
Directory with paper_X.pdf files |
含 paper_X.pdf 的目錄 |
--output, -o |
Citation_FactCheck_Report.xlsx |
Output Excel path | 輸出 Excel 路徑 |
--pages, -p |
10 |
Max pages per PDF | 每份 PDF 最大頁數 |
--scaffold-only |
false |
Only run mapping report | 僅執行對應報告 |
--openrouter-key |
$OPENROUTER_API_KEY |
OpenRouter API key for semantic search | OpenRouter API 金鑰 |
| Column | Description | 說明 |
|---|---|---|
| # | Row number | 列號 |
| Section / Para | Location in manuscript | 稿件位置 |
| Sentence Cited | Exact quoted text containing the citation | 包含引用的原文 |
| Ref(s) Checked | Citation number(s) | 引用編號 |
| Full Reference | Complete bibliographic entry | 完整書目條目 |
| Claim Being Made | What the author attributes to this reference | 作者對此引用的宣稱 |
| Verdict | Accuracy rating with color coding | 準確性判定(顏色標記) |
| PDF Evidence / Finding | What the PDF actually says | PDF 實際內容 |
| Recommendation | Specific action (replace ref, reword claim, etc.) | 具體建議行動 |
Non-citation issues: formatting problems, missing sections, draft notes, reference numbering gaps, file naming errors.
非引用問題:格式錯誤、缺少章節、草稿註記、引用編號缺漏、檔案命名錯誤。
Areas where argumentation could be strengthened, with specific additional references to consider.
可強化論述的部分,附具體推薦引用。
Anticipated reviewer criticisms rated by severity (CRITICAL / HIGH / MEDIUM), with explanations of why reviewers will raise each point and suggested responses or preemptive revisions.
依嚴重程度(CRITICAL / HIGH / MEDIUM)評等的預期審稿批評,附原因說明與建議回應。
Color coding explanation for all verdict types. | 所有判定類型的顏色編碼說明。
See WORKFLOW.md for the complete step-by-step standard operating procedure, including the Claude Code prompt template for automated execution.
參閱 WORKFLOW.md 了解完整標準作業程序,包含 Claude Code 自動執行的提示模板。
MIT