AI-Powered Web Scraping Tools Collection

A modern web scraping toolkit that combines traditional scraping methods with AI capabilities. This project provides a Streamlit-based UI for easy interaction with various scraping tools including BeautifulSoup, FireCrawl, Jina.ai Reader, and ScrapeGraphAI.

Features

1. Basic Scraper

HTML content extraction using BeautifulSoup4
Structured data extraction including:
- Headings (H1, H2, H3)
- Paragraphs
- Links
JSON output format
Custom CSS selector support

2. FireCrawl Scraper

Asynchronous web crawling
Structured data extraction
Progress tracking
Custom parameter support

3. Jina.ai Reader

AI-powered content extraction
Text summarization
Image extraction with captions
Multiple output formats

4. ScrapeGraphAI

Multiple scraping modes:
- SMART_SCRAPER
- SEARCH
- OMNI_SCRAPER
GPT-3.5 integration
Custom prompt support

Installation

Clone the repository:

git clone https://github.com/imanoop7/Web-Scrapping-Tool-for-AI
cd Web-Scrapping-Tool-for-AI

Install dependencies:

pip install -r requirements.txt

Usage

Start the Streamlit app:

streamlit run streamlit_app.py

In the web interface:
- Enter the URL to scrape
- Select your preferred scraping method
- Configure any additional parameters
- Click "Start Scraping"
View and download results in JSON format

Required API Keys

Store your API keys securely. The following APIs are required for full functionality:

FireCrawl API key
Jina.ai API key
OpenAI API key (for ScrapeGraphAI)

Dependencies

Python 3.8+
streamlit
requests
beautifulsoup4
pydantic
aiohttp
asyncio
typing-extensions

Project Structure

├── streamlit_app.py      # Main Streamlit application
├── basic_scraper.py      # BeautifulSoup-based scraper
├── firecrawl_scraper.py  # FireCrawl integration
├── jina_reader.py        # Jina.ai Reader integration
├── scrapegraph_ai.py     # ScrapeGraphAI implementation
└── requirements.txt      # Project dependencies

Features by Scraper

Basic Scraper

Content extraction:
- Headings (H1, H2, H3)
- Paragraphs (with length filtering)
- Links with text and URLs
Custom selector support
JSON output with timestamp
Error handling and validation

FireCrawl Scraper

Asynchronous operation
Progress tracking
Structured data extraction
Custom parameter support

Jina Reader

URL content extraction
Text summarization
Image extraction
Error handling

ScrapeGraphAI

Multiple graph types
GPT-3.5 integration
Custom prompts
Image processing support

Best Practices

Rate Limiting
- Implement appropriate delays between requests
- Respect website robots.txt
Error Handling
- All scrapers include comprehensive error handling
- Validation of API keys and inputs
Data Management
- Results saved in structured JSON format
- Timestamp-based file naming
- Download functionality for results

Contributing

Fork the repository
Create your feature branch (git checkout -b feature/AmazingFeature)
Commit your changes (git commit -m 'Add some AmazingFeature')
Push to the branch (git push origin feature/AmazingFeature)
Open a Pull Request

License

This project is licensed under the MIT License - see the LICENSE file for details.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AI-Powered Web Scraping Tools Collection

Features

1. Basic Scraper

2. FireCrawl Scraper

3. Jina.ai Reader

4. ScrapeGraphAI

Installation

Usage

Required API Keys

Dependencies

Project Structure

Features by Scraper

Basic Scraper

FireCrawl Scraper

Jina Reader

ScrapeGraphAI

Best Practices

Contributing

License

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
__pycache__		__pycache__
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
basic_scraper.py		basic_scraper.py
firecrawl_scraper.py		firecrawl_scraper.py
jina_reader.py		jina_reader.py
requirements.txt		requirements.txt
scrapegraph_ai.py		scrapegraph_ai.py
streamlit_app.py		streamlit_app.py

License

imanoop7/Web-Scrapping-Tool-for-AI

Folders and files

Latest commit

History

Repository files navigation

AI-Powered Web Scraping Tools Collection

Features

1. Basic Scraper

2. FireCrawl Scraper

3. Jina.ai Reader

4. ScrapeGraphAI

Installation

Usage

Required API Keys

Dependencies

Project Structure

Features by Scraper

Basic Scraper

FireCrawl Scraper

Jina Reader

ScrapeGraphAI

Best Practices

Contributing

License

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages