This Python script is designed to scrape websites and convert their content to Markdown format. It uses Scrapy for web crawling and html2text for HTML to Markdown conversion. The application provides both a command-line interface and a user-friendly GUI. This is primarily for documentation sites so you can feed docs directly to an llm.
- Crawls specified websites (single URL or multiple URLs from a file)
- Converts HTML content to Markdown
- Saves each page as a separate Markdown file, maintaining the original site structure
- Creates a table of contents for the scraped content
- Adds the output directory to .gitignore in the Git root directory
- Implements secure coding practices including input validation and sanitization
- Option to combine markdown files per directory during scraping
- Handles 404 errors by attempting to crawl deeper into the directory structure
- Implements rate limiting to prevent overwhelming target websites
- Provides both CLI and GUI interfaces for ease of use
- Verbose inline documentation for improved code readability and maintainability
- Python 3.7+
- scrapy
- html2text
- validators
- beautifulsoup4
- typing
- nicegui
Install the required packages using:
pip install -r requirements.txt
To launch the graphical user interface:
python gui.py
The GUI provides an intuitive interface with:
- Text area for entering URLs (one per line)
- Output directory selection
- Various options including:
- Ignore links toggle
- Verbose output toggle
- Combine markdown files toggle
- Delay between requests
- Custom user agent
- Subdirectory limitation
- Data download limit
- Real-time status updates
- Start/Clear buttons
python web_scraper.py -u <url> -o <output_dir> [options]
python web_scraper.py -f <url_file> -o <output_dir> [options]
-u,--urlURL: The URL to scrape-f,--fileFILE: File containing URLs to scrape (one per line)-o,--outputOUTPUT_DIR: The directory to save the Markdown files-i,--ignore-links: Ignore links in the HTML when converting to Markdown-a,--user-agentAGENT: Set a custom user agent string-v,--verbose: Increase output verbosity-s,--subdirSUBDIR: Limit scraping to a specific subdirectory--data-download-limitLIMIT: Limit the amount of data to download (e.g., 4GB)--combine-markdown: Combine markdown files per directory during scraping-d,--delayDELAY: Set the delay between requests in seconds (default: 1.0)
-
Using the GUI:
python gui.py -
Scrape a single URL via CLI:
python web_scraper.py -u https://example.com -o scraped_content -
Scrape multiple URLs from a file:
python web_scraper.py -f urls.txt -o scraped_content -
Scrape with a 100MB data limit and custom user agent:
python web_scraper.py -u https://example.com -o scraped_content --data-download-limit 100MB -a "MyBot/1.0" -
Scrape and combine markdown files per directory:
python web_scraper.py -u https://example.com -o scraped_content --combine-markdown -
Scrape with a custom delay between requests:
python web_scraper.py -u https://example.com -o scraped_content -d 2.0
The script creates a directory structure mirroring the scraped website(s) and saves each page as a separate Markdown file. It also generates a table_of_contents.md file in the output directory, providing an overview of all scraped pages. If the --combine-markdown option is used, it will also create a combined.md file in each directory containing all the markdown content for that directory.
- The script implements input validation to ensure only valid URLs are processed.
- It uses the
validatorslibrary to verify URL integrity. - The script sanitizes file paths to prevent directory traversal attacks.
- It limits scraping to the specified domain and subdirectory (if provided) to prevent unintended access to other parts of the website.
- Rate limiting is implemented to prevent overwhelming target websites.
- The data download limit feature is in beta and may not work as expected in all scenarios.
- The script does not handle JavaScript-rendered content, as it relies on Scrapy's static HTML parsing.
- The combined feature works. However, some websites (like https://supabase.com/docs/reference/javascript/installing) have a directory structure that actually all points to the same document. This causes the script to download the same document multiple times. This is a limitation of the script and not a bug.
- The output may contain some formatting inconsistencies due to the HTML to Markdown conversion process.
- Added graphical user interface using NiceGUI
- Added option to combine markdown files per directory during scraping
- Implemented handling of 404 errors by attempting to crawl deeper into the directory structure
- Added verbose inline documentation to improve code readability and maintainability
- Implemented rate limiting to prevent overwhelming target websites
Contributions to improve the script are welcome. Please ensure that any pull requests maintain or improve the existing security measures and code quality.
This project is licensed under the GNU AGPLv3 License - see the LICENSE file for details.