Web Scraper to Markdown

This Python script is designed to scrape websites and convert their content to Markdown format. It uses Scrapy for web crawling and html2text for HTML to Markdown conversion. The application provides both a command-line interface and a user-friendly GUI. This is primarily for documentation sites so you can feed docs directly to an llm.

Features

Crawls specified websites (single URL or multiple URLs from a file)
Converts HTML content to Markdown
Saves each page as a separate Markdown file, maintaining the original site structure
Creates a table of contents for the scraped content
Adds the output directory to .gitignore in the Git root directory
Implements secure coding practices including input validation and sanitization
Option to combine markdown files per directory during scraping
Handles 404 errors by attempting to crawl deeper into the directory structure
Implements rate limiting to prevent overwhelming target websites
Provides both CLI and GUI interfaces for ease of use
Verbose inline documentation for improved code readability and maintainability

Requirements

Python 3.7+
scrapy
html2text
validators
beautifulsoup4
typing
nicegui

Install the required packages using:

pip install -r requirements.txt

Usage

GUI Interface

To launch the graphical user interface:

python gui.py

The GUI provides an intuitive interface with:

Text area for entering URLs (one per line)
Output directory selection
Various options including:
- Ignore links toggle
- Verbose output toggle
- Combine markdown files toggle
- Delay between requests
- Custom user agent
- Subdirectory limitation
- Data download limit
Real-time status updates
Start/Clear buttons

Command Line Interface

python web_scraper.py -u <url> -o <output_dir> [options]
python web_scraper.py -f <url_file> -o <output_dir> [options]

Options

-u, --url URL: The URL to scrape
-f, --file FILE: File containing URLs to scrape (one per line)
-o, --output OUTPUT_DIR: The directory to save the Markdown files
-i, --ignore-links: Ignore links in the HTML when converting to Markdown
-a, --user-agent AGENT: Set a custom user agent string
-v, --verbose: Increase output verbosity
-s, --subdir SUBDIR: Limit scraping to a specific subdirectory
--data-download-limit LIMIT: Limit the amount of data to download (e.g., 4GB)
--combine-markdown: Combine markdown files per directory during scraping
-d, --delay DELAY: Set the delay between requests in seconds (default: 1.0)

Examples

Using the GUI:
```
python gui.py
```

Scrape a single URL via CLI:

python web_scraper.py -u https://example.com -o scraped_content

Scrape multiple URLs from a file:

python web_scraper.py -f urls.txt -o scraped_content

Scrape with a 100MB data limit and custom user agent:

python web_scraper.py -u https://example.com -o scraped_content --data-download-limit 100MB -a "MyBot/1.0"

Scrape and combine markdown files per directory:

python web_scraper.py -u https://example.com -o scraped_content --combine-markdown

Scrape with a custom delay between requests:

python web_scraper.py -u https://example.com -o scraped_content -d 2.0

Output

The script creates a directory structure mirroring the scraped website(s) and saves each page as a separate Markdown file. It also generates a table_of_contents.md file in the output directory, providing an overview of all scraped pages. If the --combine-markdown option is used, it will also create a combined.md file in each directory containing all the markdown content for that directory.

Security Considerations

The script implements input validation to ensure only valid URLs are processed.
It uses the validators library to verify URL integrity.
The script sanitizes file paths to prevent directory traversal attacks.
It limits scraping to the specified domain and subdirectory (if provided) to prevent unintended access to other parts of the website.
Rate limiting is implemented to prevent overwhelming target websites.

Limitations/Issues

The data download limit feature is in beta and may not work as expected in all scenarios.
The script does not handle JavaScript-rendered content, as it relies on Scrapy's static HTML parsing.
The combined feature works. However, some websites (like https://supabase.com/docs/reference/javascript/installing) have a directory structure that actually all points to the same document. This causes the script to download the same document multiple times. This is a limitation of the script and not a bug.
The output may contain some formatting inconsistencies due to the HTML to Markdown conversion process.

Recent Changes

Added graphical user interface using NiceGUI
Added option to combine markdown files per directory during scraping
Implemented handling of 404 errors by attempting to crawl deeper into the directory structure
Added verbose inline documentation to improve code readability and maintainability
Implemented rate limiting to prevent overwhelming target websites

Contributing

Contributions to improve the script are welcome. Please ensure that any pull requests maintain or improve the existing security measures and code quality.

License

This project is licensed under the GNU AGPLv3 License - see the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 50 Commits
__pycache__		__pycache__
.gitignore		.gitignore
README.md		README.md
gui.py		gui.py
requirements.txt		requirements.txt
web_scraper.py		web_scraper.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Web Scraper to Markdown

Features

Requirements

Usage

GUI Interface

Command Line Interface

Options

Examples

Output

Security Considerations

Limitations/Issues

Recent Changes

Contributing

License

About

Uh oh!

Releases

Packages

Uh oh!

Languages

jethrop/scrapeToMarkdown

Folders and files

Latest commit

History

Repository files navigation

Web Scraper to Markdown

Features

Requirements

Usage

GUI Interface

Command Line Interface

Options

Examples

Output

Security Considerations

Limitations/Issues

Recent Changes

Contributing

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages