PAGEBINDER

A clean, efficient Python tool that crawls websites and converts them into a single PDF document with additional features (check CLI options).

Features

Full Website Crawling: Automatically discovers and crawls all pages within a domain
Visual Preservation: Generated PDFs look exactly like the original website
Clickable Links: All links in the original website remain functional in the PDF
Firefox-Based: Uses Firefox WebDriver for reliable rendering
Customizable: Configurable page limits and output options
Clean Output: Professional-looking merged PDF documents

Prerequisites

Before using this tool, you need:

Python 3.7+
Firefox Browser - The tool uses Firefox for web rendering

That's it! GeckoDriver is included in the project, so no additional system dependencies are required.

Installation

Clone the repository:

git clone https://github.com/GurenMashu/pagebinder.git
cd pagebinder

Install Python dependencies:

pip install -r requirements.txt

Make geckodriver executable (Linux/Mac only):

chmod +x drivers/geckodriver

Windows users: The geckodriver is ready to use, no additional steps needed.

You're ready to go! The tool includes everything needed to run.

Usage

Basic Usage

python website_crawler.py https://example.com

Advanced Options

# Specify output filename and maximum pages
python website_crawler.py https://docs.python.org -o python_docs.pdf -m 100 -i 

# Run in visible mode (non-headless)
python website_crawler.py https://blog.example.com --no-headless -m 25

# Show help
python website_crawler.py --help

Command Line Options

Option	Description	Default
`url`	Website URL to crawl (required)	-
`-o, --output`	Output PDF filename	`website.pdf`
`-m, --max-pages`	Maximum pages to crawl	`50`
`--no-headless`	Run browser in visible mode	Headless mode
`-i, --index`	Generate hierarchical table of contents with clickable links	-
`--include`	Include only URLs matching this pattern (regex). Can be used multiple times	-
`--exclude`	Exclude URLs matching this pattern (regex). Can be used multiple times	-
`--max-depth`	Maximum URL depth from base URL (e.g., 2 = two levels deep)	-
`--resume`	Resume from previous interrupted crawl	-
`--state-file`	State file for resume functionality (default: crawler_state.json)	-

How It Works 🔧

Initialization: Sets up Firefox WebDriver with optimized settings
Crawling: Starting from the base URL, discovers all internal links
PDF Generation: Converts each page to PDF while preserving layout and links
Merging: Combines all individual PDFs into a single document
Cleanup: Removes temporary files and closes browser

Examples

Documentation Site

python website_crawler.py https://requests.readthedocs.io -o requests_docs.pdf -m 75 --max-depth 2

Blog or News Site

python website_crawler.py https://realpython.com/blog/ -o realpython_blog.pdf -m 100 -i

Small Website (All Pages)

python website_crawler.py https://smallwebsite.com -m 1000

Output

The tool generates:

A single PDF file with all crawled pages
Preserved visual styling and layout
Functional clickable links
Progress feedback during crawling
File size information upon completion
Hierarchical index
State files to save current extent of crawl - user-enabled

Limitations ⚠️

Only crawls pages within the same domain
Requires JavaScript-enabled browser (Firefox)
Large sites may take considerable time to process
PDF size depends on content and number of pages
Some dynamic content may not render identically

Contributing 🤝

Contributions are welcome! Please feel free to submit a Pull Request. For major changes, please open an issue first to discuss what you would like to change.

Fork the repository
Create your feature branch (git checkout -b feature/AmazingFeature)
Commit your changes (git commit -m 'Add some AmazingFeature')
Push to the branch (git push origin feature/AmazingFeature)
Open a Pull Request

License

This project is licensed under the MIT License - see the LICENSE file for details.

Troubleshooting 🔍

Common Issues

"geckodriver not found"

Make sure geckodriver is installed and in your PATH
Try specifying the full path to geckodriver

"Connection refused" or timeout errors

Check your internet connection
Some websites may block automated requests
Try reducing the crawl speed or using --no-headless mode

Large PDF files

Consider reducing the --max-pages limit
Some sites have many pages or large images

Memory issues

Close other applications while crawling large sites
Reduce the maximum pages limit
Consider splitting large sites into multiple smaller PDFs

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
drivers		drivers
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
WebsiteCrawler.py		WebsiteCrawler.py
pagebinder.py		pagebinder.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PAGEBINDER

Features

Prerequisites

Installation

Usage

Basic Usage

Advanced Options

Command Line Options

How It Works 🔧

Examples

Documentation Site

Blog or News Site

Small Website (All Pages)

Output

Limitations ⚠️

Contributing 🤝

License

Troubleshooting 🔍

Common Issues

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

PAGEBINDER

Features

Prerequisites

Installation

Usage

Basic Usage

Advanced Options

Command Line Options

How It Works 🔧

Examples

Documentation Site

Blog or News Site

Small Website (All Pages)

Output

Limitations ⚠️

Contributing 🤝

License

Troubleshooting 🔍

Common Issues

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages