Paper Parser 🛠️

Efficient arXiv Search, Download, and AI-Friendly Markdown Parsing.

paper-parser is a CLI tool designed to streamline the academic research workflow. It handles everything from finding a paper on arXiv to converting it into a clean, structured Markdown format that is optimized for LLMs and AI agents.

🚀 Why Use Paper Parser?

Standard PDF-to-text tools often produce one massive block of text, which leads to two major problems when working with AI:

Context Overflow: Large papers can exceed an LLM's context window.
Token Waste: Paying for the entire paper's context when you only need to analyze the "Methodology" or "Conclusion" is expensive and slow.

The Solution: paper-parser uses the MinerU V4 API to extract high-quality Markdown and then automatically splits the paper into chapters. This allows AI agents to read the paper section-by-section, enabling:

✅ Granular Context Management: Only read what matters.
✅ Significant Token Savings: Drastically reduce your API bills.
✅ Higher Accuracy: Focus the model's attention on specific sections.

✨ Key Features

🔍 Intelligent Search: Typos? No problem. Fuzzy-searches arXiv with relevance ranking.
📥 Smart Download: Downloads PDFs into organized, ID-based directories.
🧩 Section Splitting: Automatically splits papers into 01_Introduction.md, 02_Methodology.md, etc.
📦 Incremental Processing: Remembers what you've already downloaded and parsed—no redundant API calls.
🖼️ Image Extraction: Extracts images and maintains correct relative links within the Markdown chapters.
📝 Note Templates: Automatically generates title.md and summary.md for your research notes.

🛠️ Installation

From PyPI (Recommended)

pip install paper-parser-skill==v0.1.3

From Source

# Clone the repository
git clone https://github.com/KaiHangYang/paper-parser-skill.git
cd paper-parser-skill

# Install in editable mode
pip install -e .

⚙️ Configuration

The first time you run pp, it will create a configuration file at ~/.paper-parser/config.yaml.

MINERU_API_TOKEN: "your_token_from_mineru.net"
PAPER_WORKSPACE: "~/paper-parser-workspace"
MINERU_API_TIMEOUT: 600

Important

You need an API token from MinerU to use the parsing features.

📖 Usage Guide

Basic Commands

# Search for a paper by keyword or arXiv ID
pp search "LLaMA 3"
pp search 2303.17564

# Download a paper PDF (cached if already downloaded)
pp download 2303.17564

# Find where a paper is stored locally
pp path 2303.17564

Parsing — Synchronous (Blocking ⚠️)

Warning

pp parse and pp all block until cloud processing completes, which can take several minutes. For agent/automation use, prefer the async workflow below.

# Parse a local PDF or an arXiv paper already downloaded
pp parse 2303.17564
pp parse ./my_local_paper.pdf

# Full workflow in one shot: Search → Download → Parse
pp all 2303.17564

Parsing — Async (Recommended for Agents ✅)

# Step 1: Submit for parsing and return immediately
#  → auto-downloads PDF if needed, uploads to MinerU, returns batch_id
pp submit 2303.17564

# Step 2: Check status later — downloads results automatically when done
pp check 2303.17564
# → "⏳ Still processing" or "✅ Parsing complete!"

Tip

pp submit is idempotent: calling it again on the same paper won't re-upload. It checks the existing task status and returns the current state instead. All commands (parse, all, submit, check) share the same .parse_task.json state file, so you can freely mix sync and async workflows.

📂 Output Structure

PAPER_WORKSPACE/
└── 2303.17564/              # ArXiv ID
    ├── paper.pdf            # Original PDF
    ├── title.md             # Paper metadata
    ├── summary.md           # Note-taking template
    ├── .parse_task.json     # Task state (batch_id, status, timestamps)
    └── markdowns/           # AI-Ready Content
        ├── 01_Introduction.md
        ├── 02_Methods.md
        ├── ...
        └── images/          # Extracted figures & tables

🤝 Acknowledgments

arXiv for the academic paper API.
RapidFuzz for fast fuzzy string matching.
MinerU (mineru.net) for high-quality PDF-to-Markdown parsing.

📜 License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
.github/workflows		.github/workflows
paper_parser		paper_parser
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
SKILL.md		SKILL.md
config-example.yaml		config-example.yaml
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Paper Parser 🛠️

🚀 Why Use Paper Parser?

✨ Key Features

🛠️ Installation

From PyPI (Recommended)

From Source

⚙️ Configuration

📖 Usage Guide

Basic Commands

Parsing — Synchronous (Blocking ⚠️)

Parsing — Async (Recommended for Agents ✅)

📂 Output Structure

🤝 Acknowledgments

📜 License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Paper Parser 🛠️

🚀 Why Use Paper Parser?

✨ Key Features

🛠️ Installation

From PyPI (Recommended)

From Source

⚙️ Configuration

📖 Usage Guide

Basic Commands

Parsing — Synchronous (Blocking ⚠️)

Parsing — Async (Recommended for Agents ✅)

📂 Output Structure

🤝 Acknowledgments

📜 License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages