A modern web application that processes PDF files, extracts their content, and presents it in a clean format suitable for use as context in LLM prompts. Built with Flask and modern JavaScript.
- 📄 Upload and process PDF files up to 50MB
- 🔍 Extract text content while maintaining formatting
- 🧹 Clean and format extracted text
- 📋 Copy extracted content to clipboard with one click
- 📱 Responsive web interface
- 🔒 Local processing for privacy
- 📊 Word count statistics
- 🌐 Support for all modern browsers
- ⚡ Real-time upload progress indication
- Python 3.6 or higher
- pip (Python package installer)
- libmagic (for file type detection)
brew install libmagic
sudo apt-get install libmagic1
Download and install the latest version of python-magic-bin.
- Clone this repository:
git clone https://github.com/yourusername/pdf-processor.git
cd pdf-processor
- Create a virtual environment:
python3 -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
- Install dependencies:
pip install -r requirements.txt
- Start the server:
python app.py
- Open your browser and navigate to:
http://localhost:8080
- Use the application:
- Drag and drop a PDF file or click to browse
- Wait for the upload and processing to complete
- View the extracted text
- Use the "Copy to Clipboard" button to copy the content
- Check word count statistics
pdf-processor/
├── app.py # Main Flask application
├── requirements.txt # Python dependencies
├── static/ # Static files
│ ├── styles.css # CSS styles
│ └── script.js # Frontend JavaScript
├── templates/ # HTML templates
│ └── index.html # Main page template
└── README.md # Project documentation
- Fork the repository
- Create a new branch (
git checkout -b feature/improvement
) - Make your changes
- Commit your changes (
git commit -am 'Add new feature'
) - Push to the branch (
git push origin feature/improvement
) - Create a Pull Request
- All processing is done locally
- No files are stored permanently
- Uploaded files are deleted after processing
- File type validation before processing
- Size limit enforcement
- Chrome (latest)
- Firefox (latest)
- Safari (latest)
- Edge (latest)
- AirPlay may use port 5000 on macOS (solution: we use port 8080)
- Large PDFs (>50MB) are not supported to ensure stable performance
This project is licensed under the MIT License - see the LICENSE file for details.
- PyPDF2 for PDF processing
- Flask for the web framework
- python-magic for file type detection