Skip to content

RoshRaj01/Manual_Paper_Anonymizer

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

4 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

๐Ÿ“„ Paper Anonymizer

A web-based tool to anonymize research papers by removing author names, affiliations, and other identifying information before peer review.


๐Ÿš€ Features

  • ๐Ÿ“‚ Select input and output folders
  • ๐Ÿ“„ Supports PDF, DOC, DOCX files
  • ๐Ÿ”„ Automatic DOC/DOCX โ†’ PDF conversion using Microsoft Word COM
  • ๐Ÿ–ฑ๏ธ Interactive PDF viewer with drag-to-select redaction
  • โœ‚๏ธ Remove selected regions precisely (word-level redaction)
  • ๐Ÿ‘๏ธ Preview removals before saving
  • โ†ฉ๏ธ Undo applied redactions
  • ๐Ÿ’พ Save anonymized files to output folder
  • ๐Ÿ”— Merge acknowledgement document (optional)
  • ๐Ÿงน Removes PDF metadata for full anonymization

๐Ÿ—๏ธ Project Structure

Anonymizer-Paper/
โ”‚
โ”œโ”€โ”€ backend/
โ”‚   โ”œโ”€โ”€ app.py              # FastAPI backend
โ”‚   โ”œโ”€โ”€ converter.py        # DOC/DOCX โ†’ PDF conversion (Word COM)
โ”‚   โ”œโ”€โ”€ pdf_editor.py       # Redaction + metadata removal
โ”‚   โ”œโ”€โ”€ utils.py            # File utilities
โ”‚   โ”œโ”€โ”€ requirements.txt    # Dependencies
โ”‚   โ””โ”€โ”€ temp/               # Temporary converted PDFs
โ”‚
โ”œโ”€โ”€ input/                  # Input papers
โ”œโ”€โ”€ output/                 # Anonymized papers
โ”‚
โ”œโ”€โ”€ app.js                  # Frontend logic
โ”œโ”€โ”€ index.html              # UI
โ”œโ”€โ”€ style.css               # Styling
.
โ”œโ”€โ”€ app.py              # FastAPI backend
โ”œโ”€โ”€ converter.py        # DOC/DOCX โ†’ PDF conversion (Word COM)
โ”œโ”€โ”€ pdf_editor.py       # Redaction + metadata removal
โ”œโ”€โ”€ utils.py            # File utilities
โ”œโ”€โ”€ requirements.txt    # Dependencies
โ”‚
โ”œโ”€โ”€ index.html          # Frontend UI
โ”œโ”€โ”€ app.js              # Frontend logic
โ”œโ”€โ”€ style.css           # UI styling
โ”‚
โ”œโ”€โ”€ input/              # Input papers
โ”œโ”€โ”€ output/             # Anonymized papers
โ”œโ”€โ”€ temp/               # Temporary converted PDFs


โš™๏ธ Setup Instructions

1. Clone / Download

git clone <repo-url>
cd paper-anonymizer

2. Create Virtual Environment

python -m venv .venv
.venv\Scripts\activate   # Windows

3. Install Dependencies

pip install -r requirements.txt

Dependencies include:

  • FastAPI
  • PyMuPDF
  • pywin32 (for Word conversion)

4. Enable Word COM (IMPORTANT)

python -m win32com.client.makepy

Then select:

Microsoft Word XX.X Object Library

โš ๏ธ Requires:

  • Windows OS
  • Microsoft Word installed

5. Run Backend Server

uvicorn app:app --reload

Server runs at:

http://localhost:8000

6. Open Frontend

Open:

index.html

Or run via Live Server:

http://localhost:63342/.../index.html

๐Ÿง  How It Works

1. File Loading

  • Backend lists files using list_files()
  • DOC/DOCX files are converted using Word COM

2. PDF Rendering

  • Uses PDF.js to render pages in browser
  • Text layer enables accurate selection

3. Selection System

  • User drags to select regions
  • Coordinates are converted to PDF space

4. Redaction Engine

From pdf_editor.py:

  • Extracts words using:

    page.get_text("words")
  • Removes only words intersecting selection

  • Uses overlap threshold (>20%) for accuracy

5. Metadata Removal

doc.set_metadata({})
doc.del_xml_metadata()

๐Ÿงช Workflow

  1. Select input & output folders
  2. Choose a paper from sidebar
  3. Drag to select author/affiliation area
  4. Click REMOVE (preview)
  5. Click SAVE to finalize
  6. (Optional) Upload acknowledgement and click MERGE & SAVE

โš ๏ธ Known Limitations

  • Word โ†’ PDF conversion may alter text positioning
  • Complex layouts (multi-column, tables) may cause slight inaccuracies
  • Requires Microsoft Word (not cross-platform)

๐Ÿ”ฎ Future Improvements

  • ๐Ÿค– Auto-detect author sections
  • ๐Ÿ“Š Confidence score for anonymization
  • ๐Ÿง  NLP-based entity removal (names, emails, institutions)

๐Ÿ› ๏ธ Tech Stack

Frontend:

  • HTML, CSS, JavaScript
  • PDF.js

Backend:

  • FastAPI
  • PyMuPDF (fitz)
  • pywin32 (Word COM)

๐Ÿ“Œ Notes

  • Output files are saved as:

    <original_name>_anonymized.pdf
    
  • Temporary files stored in /temp

  • Supports recursive folder scanning


๐Ÿ“ธ Screenshots

image

โญ Summary

This tool provides a semi-automated anonymization pipeline combining:

  • manual precision (user selection)
  • automated processing (word-level redaction + metadata removal)

Designed for research paper review workflows where bias-free evaluation is required.

Releases

No releases published

Packages

 
 
 

Contributors