A powerful and customizable multithreaded web crawler and scraper built with Scrapy, Tkinter, and EasyGUI. It supports features such as:
- Recursive crawling
- Full web content extraction
- Proxy validation
- User-agent rotation
- Parallel processing
- Custom scraping/crawling modes
- Media downloads (images, videos, audio)
- Custom run time and throttling options
- ✅ Proxy support with validation
- ✅ Rotating user agents
- ✅ Recursive depth-limited crawling
- ✅ Scrape or full scrape mode
- ✅ Parallel crawler processes
- ✅ AutoThrottle and robots.txt support
- ✅ HTML, metadata, text, and media downloads
- ✅ EasyGUI interface for parameters
- ✅ Terminal suppression for cleaner operation
- Python 3.7+
scrapyrequestsbs4easygui
Install dependencies with:
pip install -r requirements.txtrequirements.txt:
scrapy
requests
beautifulsoup4
easygui
git clone https://github.com/yourusername/UltraCrawlerAndScraper.git
cd UltraCrawlerAndScraperpython your_script_name.py- Select your action (crawl, scrape, full-scrape)
- Input URL or load a list of URLs
- Select proxy file and user-agent file
- Choose depth, delay, number of proxies, and more
- Define whether to respect
robots.txtand enableAutoThrottle - Set output folder to store the results
| Mode | Description |
|---|---|
Crawl |
Discovers and stores new links recursively |
Scrape |
Downloads only the raw HTML of listed URLs |
Full-Scrape |
Extracts metadata, headings, paragraphs, links, and media (images/videos/audio) |
your_output_folder/
├── discovered_urls.txt
├── scraped_urls.txt
├── example_com.html
├── example_com_metadata.txt
├── example_com_paragraphs.txt
├── example_com_links.txt
├── images/
│ ├── img1.jpeg
│ └── ...
├── videos/
│ ├── video1.mp4
├── audio/
│ ├── sound1.mp3
- User selects the mode and parameters via GUI.
- The program validates proxies concurrently.
- Multiple crawler instances run in parallel using
multiprocessing. - Data is saved to disk: HTML, media files, and extracted content.
- URLs are tracked to avoid duplicates and ensure efficient recursion.
- Proxy and User-Agent files should be plain text files with one entry per line.
- This tool does not support JavaScript-rendered content. Consider integrating Selenium for advanced scraping needs.
- Proxy validation uses websites like Google, Bing, and HttpBin.
This tool is for educational and ethical use only. Always respect website terms of service and robots.txt rules.
Developed by Nicholas Taylor.