A simple Python CLI Pipeline for batch processing documents (PDF, DOCX, PPTX, TXT, MD, ODP) into structured summaries using the ultra low-cost Doubleword API and open-weight models. Just load your docs, a prompt, and then the Doubleword API key and you're good to go.
This tool extracts text from multiple document formats and generates comprehensive ~2000-word structured summaries using Doubleword's batch inference API. Originally built for literature reviews in actuarial machine learning research, it can be adapted for any bulk document summarization task.
- Literature reviews - Summarize academic papers systematically
- Regulatory analysis - Convert 200-page consultation papers into actionable digests
- Compliance - Extract structured data from policy documents at scale
- Sentiment analysis - Process customer feedback documents in bulk
- Research synthesis - Analyze collections of technical reports
- LLM/Agent Evaluations - Use LLM as a Judge to evaluate LLM and Agent outputs
Real-world results:
- Initial test: 2 papers processed in ~1 minute
- Production run: 33 papers processed in ~30 minutes
- Total cost: ~15 pence for 35 papers
- SLA: Selected 24-hour window, actual delivery < 30 minutes
- PDF (
.pdf) - Research papers, reports, articles - Microsoft Word (
.docx) - Documents, proposals - Microsoft PowerPoint (
.pptx) - Presentations, slide decks - OpenDocument Presentation (
.odp) - Open format presentations - Plain Text (
.txt) - Text documents - Markdown (
.md) - Technical documentation, notes
All formats are processed through the same pipeline with automatic file type detection.
The pipeline consists of three stages:
Script: create_batch.py
- Scans
data/papers/folder (or custom location via--input-dir) - Extracts text from multiple formats:
- PDF: pypdf (fast) with pdfplumber fallback (robust)
- DOCX: python-docx
- PPTX: python-pptx
- ODP: odfpy
- TXT/MD: Direct text read
- Creates structured JSONL batch requests with custom summarization prompt
- Outputs:
batch_requests_{timestamp}.jsonl
Script: submit_batch.py
- Uploads
batch_requests.jsonlto Doubleword API - Creates batch job with 1-hour completion window
- Saves batch ID to
batch_id.txtfor tracking - Outputs: Batch ID for monitoring
Script: poll_and_process.py
- Polls batch job status at configurable intervals (default: 60 seconds)
- Automatically downloads results when completed
- Calls
process_results.pyto extract and save individual summaries - Outputs: Individual markdown summaries in
data/summaries/
Script: process_results.py
- Downloads batch output file from Doubleword API
- Parses JSONL responses
- Saves each summary as timestamped markdown file
- Format:
{filename}_summary_{timestamp}.md
git clone https://github.com/NnamdiOdozi/batch_summary_doubleword.git
Using uv (recommended):
uv sync
source .venv/bin/activate # Linux/macOS
# OR on Windows: .venv\Scripts\activateOr using pip:
python3 -m venv .venv
source .venv/bin/activate # Linux/macOS
# OR on Windows: .venv\Scripts\activate
pip install -r requirements.txtRequirements (requirements.txt):
pypdf>=6.6.0- Fast PDF text extractionpdfplumber>=0.11.9- Robust fallback for complex PDFspython-docx>=1.1.0- Microsoft Word document extractionpython-pptx>=1.0.0- PowerPoint presentation extractionodfpy>=1.4.1- OpenDocument format extractionopenai>=2.14.0- API client (compatible with Doubleword API)python-dotenv>=1.1.0- Environment variable management
Copy the sample environment file:
cp .env.sample .envEdit .env and fill in your credentials:
# Your Doubleword API token
DOUBLEWORD_AUTH_TOKEN=your_api_token_here
# Doubleword API endpoint
DOUBLEWORD_BASE_URL=https://api.doubleword.ai/v1
# API endpoint for chat completions (relative to base URL)
CHAT_COMPLETIONS_ENDPOINT=/v1/chat/completions
# Model to use
DOUBLEWORD_MODEL=Qwen/Qwen3-VL-235B-A22B-Instruct-FP8
or any other model you would like eg the smaller and cheaper Qwen/Qwen3-VL-30B-A3B-Instruct-FP8
# Polling frequency (seconds)
POLLING_INTERVAL=60
# Batch completion window or SLA (how long the API has to complete the job)
# Options: "1h" or "24h"
COMPLETION_WINDOW=1h
# Summary word count (target length for generated summaries)
SUMMARY_WORD_COUNT=2000
# Maximum tokens for model response (includes reasoning + summary)
MAX_TOKENS=5000Get your API key:
- Visit Doubleword Portal
- Click to join Private Preview
- Create account or log in
- Generate API key in settings
Place documents in:
data/papers/folder
Supported formats: PDF, DOCX, PPTX, ODP, TXT, MD
The pipeline will automatically detect and process all supported files in this directory.
Adjust word count:
Edit SUMMARY_WORD_COUNT in .env to change summary length (default: 2000 words)
Customize prompt template: Edit summarisation_prompt.txt to adjust:
- Output structure and fields
- Technical complexity level
- Markdown formatting
- Required fields
python run_batch_pipeline.pyThis orchestrator script runs all three stages automatically:
- Extracts documents and creates batch requests
- Submits to Doubleword API
- Polls until complete and downloads summaries
Process all files in default directory:
python run_batch_pipeline.pyProcess specific files:
python run_batch_pipeline.py --files paper1.pdf report.docx slides.pptxProcess files from custom directory:
python run_batch_pipeline.py --input-dir /path/to/documents/View all options:
python run_batch_pipeline.py --help
python create_batch.py --helpIf you prefer to run stages individually:
Stage 1: Create batch requests (all files in data/papers/)
python create_batch.pyOr process specific files:
python create_batch.py --files doc1.pdf doc2.docxOr process custom directory:
python create_batch.py --input-dir /custom/path/Output: batch_requests_{timestamp}.jsonl
Stage 2: Submit batch
python submit_batch.pyOutput: batch_id.txt with job ID
Stage 3: Poll and process
python poll_and_process.pyOutput: Individual summaries in data/summaries/
The polling script shows real-time status:
[2026-01-25 14:32:15] Status: in_progress | Progress: 12/35
[2026-01-25 14:32:45] Status: in_progress | Progress: 24/35
[2026-01-25 14:33:15] Status: completed | Progress: 35/35
✓ Batch completed successfully!
Press Ctrl+C to stop polling. Run the script again to resume.
batch_summary_doubleword/
├── README.md # This file
├── pyproject.toml # Python dependencies (uv)
├── requirements.txt # Python dependencies (pip)
├── .env.sample # Environment variable template
├── .gitignore # Git ignore rules
├── run_batch_pipeline.py # Orchestrator script (Python)
├── summarisation_prompt.txt # Prompt template for summaries
├── create_batch.py # Stage 1: PDF extraction
├── submit_batch.py # Stage 2: Batch submission
├── poll_and_process.py # Stage 3: Polling and processing
├── process_results.py # Result processing
└── data/
├── papers/ # Input PDFs
└── summaries/ # Output summaries (auto-created)
Generated files (not in git):
batch_requests_YYYYMMDD_HHMMSS.jsonl- JSONL file with timestamped batch requestsbatch_id_YYYYMMDD_HHMMSS.txt- Timestamped batch job IDdata/summaries/*.md- Individual paper summaries
Adjust how frequently the script checks batch status:
# In .env file
POLLING_INTERVAL=60 # Check every 60 secondsLower values = faster notification, more API calls Higher values = fewer API calls, slower notification
Recommended: 30-60 seconds for most use cases
The default model is Qwen/Qwen3-VL-235B-A22B-Instruct-FP8, which supports:
- Long context windows (128K+ tokens)
- Vision capabilities (for PDFs with charts/diagrams)
- Structured output generation
To use a different model, update DOUBLEWORD_MODEL in .env.
The batch job completion window determines how long the API has to complete your job. Configure via COMPLETION_WINDOW in .env:
COMPLETION_WINDOW=1h # Options: "1h" or "24h"Doubleword typically completes jobs much faster than the window:
- 2 papers: ~1 minute
- 35 papers: ~30 minutes
Use 1h for most cases. Use 24h if you want even cheaper pricing and if task is not as time critical.
Based on actual usage (Jan 2026):
- 35 papers (mixed lengths, 45-200 pages each)
- Model: Qwen3-VL-235B-A22B-Instruct-FP8
- Cost: ~15 pence total (~0.43p per paper)
Cost varies by:
- Document length
- Requested summary length
- Model selected
- Number of requests
Error: Unauthorized
Solution: Check your DOUBLEWORD_AUTH_TOKEN in .env
Solution: Doubleword typically completes in ~1 minute. If waiting longer:
- Check Doubleword portal for job status
- Verify your completion window setting
- Contact Doubleword support if job is stuck
✗ Error processing results
Solution: Check that process_results.py has correct permissions and paths
Use the --input-dir option to process files from any directory:
python run_batch_pipeline.py --input-dir /path/to/your/documents/Or process specific files regardless of location:
python run_batch_pipeline.py --files /path/to/file1.pdf /other/path/file2.docxEdit summarisation_prompt.txt to change:
- Summary structure
- Required fields
- Output length
- Technical depth
Edit process_results.py line 37:
summaries_dir = Path('output/my_summaries') # Custom location- Python 3.12+ - Core runtime
- pypdf - Primary PDF text extraction
- pdfplumber - Fallback extraction for complex PDFs
- python-docx - Microsoft Word document extraction
- python-pptx - PowerPoint presentation extraction
- odfpy - OpenDocument format extraction
- OpenAI SDK - API client (Doubleword API is OpenAI-compatible)
- Doubleword API - Batch inference backend
- Qwen3-VL-235B - Vision-language model for document understanding
Built using:
- Doubleword AI - Batch inference platform
- [Qwen3-VL] - Open-weight vision-language model provided by Doubleword
- OpenAI-compatible API standard for seamless integration
MIT License - see LICENSE file for details
- Try out streaming feature
- Test the model's vision capabilities
- LLM as a Judge - this is often token intensive and async and so a good candidate for batch inference
- Add temperature, top_p, top_k, frequency penalty, presence penalty etc to .env or config file
- Batch inference - Processing multiple requests efficiently
- Open-weight models - Qwen3, DeepSeek, Llama alternatives to proprietary models
- Structured output - JSON/markdown formatted LLM responses
- Document intelligence - AI-powered document analysis at scale