Skip to content

YHHuan/Citation_FactCheck

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Citation FactCheck

A systematic workflow and Python toolkit for verifying manuscript citations against source PDFs. Uses a Claim-Grounded Extraction pipeline to minimize LLM hallucination, producing a comprehensive Excel report with citation accuracy verdicts, writing suggestions, and anticipated reviewer challenges.

一套系統化的工作流程與 Python 工具,用於比對論文引用與原始 PDF 來源。採用 Claim-Grounded Extraction 管線以降低 LLM 幻覺,產出包含引用準確性判定、寫作建議及預期審稿意見的完整 Excel 報告。


Algorithm: Claim-Grounded Extraction | 演算法:基於宣稱的接地式擷取

Unlike generic RAG, this tool uses a pipeline specifically designed for citation fact-checking:

有別於通用 RAG,本工具使用專為引用事實查核設計的管線:

[claim sentence | 宣稱句]
       |
  Step 1: Extract verifiable assertions (rule-based)
          擷取可驗證的斷言(基於規則)
          e.g., "sample size = 120", "effect size = 0.4", "p < 0.05"
       |
  Step 2: Hybrid retrieval — BM25 + semantic + keyword search in PDF
          混合檢索 — BM25 + 語意 + 關鍵字搜尋 PDF
       |
  Step 3: Send ONLY retrieved passages to LLM for verdict
          僅將檢索到的段落送交 LLM 判定
          with grounding requirement: "quote the exact sentence"
       |
  Step 4: Failsafe — if no passage found -> verdict = "Not Found in PDF"
          安全機制 — 若未找到段落 → 判定為「PDF 中未找到」
          (never let the LLM guess | 絕不讓 LLM 猜測)

Key benefits | 主要優勢:

  • Hallucination is contained | 幻覺受到控制 — LLM only sees relevant passages, not full text dumps | LLM 僅看到相關段落,非全文
  • Numbers/statistics are found precisely | 精確找到數據與統計值 — BM25 + regex beats semantic embeddings for quantitative claims | BM25 + 正則表達式在量化宣稱上優於語意嵌入
  • Hybrid retrieval catches paraphrases | 混合檢索捕捉改寫內容 — Semantic search finds "mortality decreased" when claim says "survival improved" | 語意搜尋能找到改寫後的對應內容
  • Failsafe prevents fabrication | 安全機制防止捏造 — if nothing is found, the system says so honestly | 若未找到,系統誠實回報
  • Cost-efficient | 成本效益高 — cheap keyword search first, expensive LLM only on retrieved chunks | 先用低成本關鍵字搜尋,再用 LLM 處理檢索結果

What It Does | 功能說明

Given a manuscript (.docx) and its reference PDFs (paper_1.pdf ... paper_N.pdf), this tool:

給定論文稿件(.docx)及其引用 PDF(paper_1.pdf ... paper_N.pdf),本工具:

  1. Extracts | 擷取 all in-text citations and maps them to reference entries | 所有文內引用並對應至參考文獻
  2. Structures | 結構化 each PDF by detecting sections (Abstract, Results, Conclusions, etc.) | 偵測各 PDF 章節
  3. Decomposes | 分解 compound claims ("A, B, and C [ref]") into individual verifiable assertions | 將複合宣稱拆分為單一可驗證斷言
  4. Retrieves | 檢索 relevant passages using hybrid BM25 + semantic + keyword search | 使用混合搜尋找到相關段落
  5. Verifies | 驗證 whether each claim is supported, with failsafe for missing evidence | 判定各宣稱是否有據支持
  6. Generates | 產出 a multi-sheet Excel report containing: | 多工作表 Excel 報告:
    • Per-section citation fact-checks (Introduction, Methods, Discussion, etc.) | 按章節引用查核
    • Manuscript structural/formatting errors | 稿件結構/格式錯誤
    • Writing improvement suggestions with recommended additional references | 寫作改善建議與推薦引用
    • Anticipated reviewer challenges with severity ratings and suggested responses | 預期審稿意見與建議回應
    • Color-coded verdict legend | 顏色編碼判定圖例

Verdict System | 判定系統

Verdict Meaning 含義
Correct Reference accurately supports the stated claim 引用準確支持宣稱
Modify Text Reference is real but claim overstates or misrepresents findings 引用存在但宣稱誇大或誤述
Weak Ref / Replace Reference is tangentially relevant (protocol paper, wrong population, etc.) 引用相關性薄弱
Wrong Citation / Wrong File PDF mismatch, wrong file, or paper doesn't support claim at all PDF 不符或完全不支持宣稱
Not Found in PDF No relevant passage located — claim cannot be verified against this source PDF 中未找到相關段落
Manuscript Error Structural, formatting, or completeness issue (not citation content) 稿件結構/格式問題

Quick Start | 快速開始

1. Install dependencies | 安裝依賴

pip install -r requirements.txt

Optional dependencies for enhanced features | 可選依賴(進階功能):

# Semantic search (local) | 語意搜尋(本地)
pip install sentence-transformers numpy

# CJK tokenization (Chinese/Japanese/Korean) | 中日韓分詞
pip install jieba

2. Prepare your files | 準備檔案

Citation_FactCheck/
  citation_factcheck.py
  manuscript/           <- put your .docx here | 放置 .docx 稿件
    Manuscript.docx
  ref_pdf/              <- put your reference PDFs here | 放置參考文獻 PDF
    paper_1.pdf
    paper_2.pdf
    ...
    paper_N.pdf
  • PDF naming | PDF 命名: paper_X.pdf where X matches the reference number in the manuscript | X 對應稿件中的引用編號
  • Manuscript | 稿件: Standard .docx with numbered in-text citations | 含有編號引用的標準 .docx

3. Run with Claude Code (recommended) | 使用 Claude Code 執行(建議)

# Open terminal in the project folder, then:
# 在專案資料夾開啟終端機,然後:
claude

# In Claude Code, paste the prompt from WORKFLOW.md
# 在 Claude Code 中貼上 WORKFLOW.md 的提示

Claude Code will use the Claim-Grounded Extraction pipeline to:

Claude Code 將使用 Claim-Grounded Extraction 管線:

  • Extract verifiable assertions from each citation claim | 從每個引用宣稱中擷取可驗證斷言
  • Search PDFs using hybrid retrieval (BM25 + semantic + keywords) | 使用混合檢索搜尋 PDF
  • Decompose compound claims for independent verification | 分解複合宣稱進行獨立驗證
  • Generate verdicts only from retrieved passages | 僅從檢索段落產出判定
  • Apply failsafe for unverifiable claims | 對無法驗證的宣稱套用安全機制

4. Or run the scaffold independently | 或獨立執行腳手架

# Basic (BM25 only) | 基本(僅 BM25)
python citation_factcheck.py \
  --manuscript manuscript/Manuscript.docx \
  --pdf-dir ref_pdf \
  --output Citation_FactCheck_Report.xlsx

# With semantic search via OpenRouter | 搭配 OpenRouter 語意搜尋
python citation_factcheck.py \
  --manuscript manuscript/Manuscript.docx \
  --pdf-dir ref_pdf \
  --output Citation_FactCheck_Report.xlsx \
  --openrouter-key YOUR_API_KEY

# Or set the environment variable | 或設定環境變數
export OPENROUTER_API_KEY=your_key_here
python citation_factcheck.py \
  --manuscript manuscript/Manuscript.docx \
  --pdf-dir ref_pdf \
  --output Citation_FactCheck_Report.xlsx

Installation | 安裝

pip install -r requirements.txt

Dependencies | 依賴套件

Required | 必要:

Package Purpose 用途
python-docx Read .docx manuscript files 讀取 .docx 稿件
PyMuPDF (fitz) Extract text from PDF with section detection 從 PDF 擷取文字與章節偵測
openpyxl Generate formatted Excel reports 產出格式化 Excel 報告
jieba CJK (Chinese) word segmentation 中文分詞

Optional | 可選:

Package Purpose 用途
sentence-transformers Local semantic search (no API needed) 本地語意搜尋(無需 API)
numpy Required by sentence-transformers sentence-transformers 依賴

Semantic Search Fallback Chain | 語意搜尋降級鏈

The hybrid retrieval system tries multiple strategies in order:

混合檢索系統依序嘗試多種策略:

1. OpenRouter API (--openrouter-key or OPENROUTER_API_KEY)
   └─ Uses text-embedding-3-small, requires API key
      使用 text-embedding-3-small,需要 API 金鑰

2. sentence-transformers (local)
   └─ Uses all-MiniLM-L6-v2 (~80MB), no API needed
      使用 all-MiniLM-L6-v2(約 80MB),無需 API

3. BM25 only (always available)
   └─ Keyword-based retrieval, no extra dependencies
      基於關鍵字檢索,無額外依賴

Features | 功能特色

Hybrid Retrieval (BM25 + Semantic) | 混合檢索

Combines keyword-based BM25 with semantic similarity using Reciprocal Rank Fusion (RRF). This catches paraphrased content that BM25 alone would miss (e.g., "mortality decreased" vs. "survival improved").

結合基於關鍵字的 BM25 與語意相似度,使用 Reciprocal Rank Fusion (RRF) 融合排序。能捕捉 BM25 會遺漏的改寫內容(例如「死亡率降低」對應「存活率提升」)。

Multi-Language Support | 多語言支援

Auto-detects CJK (Chinese/Japanese/Korean) content and switches tokenizers:

自動偵測中日韓(CJK)內容並切換分詞器:

  • English: regex-based tokenizer with stopword removal | 基於正則的分詞器與停用詞過濾
  • CJK: jieba word segmentation (falls back to character-level if not installed) | jieba 分詞(未安裝時退回字元級別)
  • Mixed text: dual tokenization for both CJK and English parts | 雙重分詞處理中英混合文本

Claim Decomposition | 宣稱分解

Automatically splits compound claims into individual verifiable assertions:

自動將複合宣稱拆分為單一可驗證斷言:

Input:  "A increased mortality, B reduced infection, and C improved outcomes [1]"
Output: ["A increased mortality", "B reduced infection", "C improved outcomes"]
Each sub-claim is verified independently against the PDF.
每個子宣稱獨立對照 PDF 驗證。

Batch Processing with Prompt Caching | 批次處理與提示快取

Groups claims by reference number so the same PDF context is reused across multiple claims. When used with Claude Code, this enables prompt caching for ~90% cost savings after the first claim per reference.

按引用編號分組宣稱,使同一 PDF 上下文可重複使用。搭配 Claude Code 使用時,可啟用提示快取,在每個引用的首次宣稱後節省約 90% 成本。

CLI Options | 命令列選項

Option Default Description 說明
--manuscript, -m (required) Path to .docx manuscript 稿件 .docx 路徑
--pdf-dir, -d ref_pdf Directory with paper_X.pdf files paper_X.pdf 的目錄
--output, -o Citation_FactCheck_Report.xlsx Output Excel path 輸出 Excel 路徑
--pages, -p 10 Max pages per PDF 每份 PDF 最大頁數
--scaffold-only false Only run mapping report 僅執行對應報告
--openrouter-key $OPENROUTER_API_KEY OpenRouter API key for semantic search OpenRouter API 金鑰

Output Format | 輸出格式

Sheet: Introduction / Methods / Discussion (per-section) | 各章節工作表

Column Description 說明
# Row number 列號
Section / Para Location in manuscript 稿件位置
Sentence Cited Exact quoted text containing the citation 包含引用的原文
Ref(s) Checked Citation number(s) 引用編號
Full Reference Complete bibliographic entry 完整書目條目
Claim Being Made What the author attributes to this reference 作者對此引用的宣稱
Verdict Accuracy rating with color coding 準確性判定(顏色標記)
PDF Evidence / Finding What the PDF actually says PDF 實際內容
Recommendation Specific action (replace ref, reword claim, etc.) 具體建議行動

Sheet: Manuscript Errors | 稿件錯誤

Non-citation issues: formatting problems, missing sections, draft notes, reference numbering gaps, file naming errors.

非引用問題:格式錯誤、缺少章節、草稿註記、引用編號缺漏、檔案命名錯誤。

Sheet: Writing & Ref Suggestions | 寫作與引用建議

Areas where argumentation could be strengthened, with specific additional references to consider.

可強化論述的部分,附具體推薦引用。

Sheet: Reviewer Challenges | 預期審稿意見

Anticipated reviewer criticisms rated by severity (CRITICAL / HIGH / MEDIUM), with explanations of why reviewers will raise each point and suggested responses or preemptive revisions.

依嚴重程度(CRITICAL / HIGH / MEDIUM)評等的預期審稿批評,附原因說明與建議回應。

Sheet: Legend | 圖例

Color coding explanation for all verdict types. | 所有判定類型的顏色編碼說明。

Detailed Workflow | 詳細工作流程

See WORKFLOW.md for the complete step-by-step standard operating procedure, including the Claude Code prompt template for automated execution.

參閱 WORKFLOW.md 了解完整標準作業程序,包含 Claude Code 自動執行的提示模板。

License | 授權

MIT

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages