A modern web scraping toolkit that combines traditional scraping methods with AI capabilities. This project provides a Streamlit-based UI for easy interaction with various scraping tools including BeautifulSoup, FireCrawl, Jina.ai Reader, and ScrapeGraphAI.
- HTML content extraction using BeautifulSoup4
- Structured data extraction including:
- Headings (H1, H2, H3)
- Paragraphs
- Links
- JSON output format
- Custom CSS selector support
- Asynchronous web crawling
- Structured data extraction
- Progress tracking
- Custom parameter support
- AI-powered content extraction
- Text summarization
- Image extraction with captions
- Multiple output formats
- Multiple scraping modes:
- SMART_SCRAPER
- SEARCH
- OMNI_SCRAPER
- GPT-3.5 integration
- Custom prompt support
- Clone the repository:
git clone https://github.com/imanoop7/Web-Scrapping-Tool-for-AI
cd Web-Scrapping-Tool-for-AI
- Install dependencies:
pip install -r requirements.txt
- Start the Streamlit app:
streamlit run streamlit_app.py
-
In the web interface:
- Enter the URL to scrape
- Select your preferred scraping method
- Configure any additional parameters
- Click "Start Scraping"
-
View and download results in JSON format
Store your API keys securely. The following APIs are required for full functionality:
- FireCrawl API key
- Jina.ai API key
- OpenAI API key (for ScrapeGraphAI)
- Python 3.8+
- streamlit
- requests
- beautifulsoup4
- pydantic
- aiohttp
- asyncio
- typing-extensions
├── streamlit_app.py # Main Streamlit application
├── basic_scraper.py # BeautifulSoup-based scraper
├── firecrawl_scraper.py # FireCrawl integration
├── jina_reader.py # Jina.ai Reader integration
├── scrapegraph_ai.py # ScrapeGraphAI implementation
└── requirements.txt # Project dependencies
- Content extraction:
- Headings (H1, H2, H3)
- Paragraphs (with length filtering)
- Links with text and URLs
- Custom selector support
- JSON output with timestamp
- Error handling and validation
- Asynchronous operation
- Progress tracking
- Structured data extraction
- Custom parameter support
- URL content extraction
- Text summarization
- Image extraction
- Error handling
- Multiple graph types
- GPT-3.5 integration
- Custom prompts
- Image processing support
-
Rate Limiting
- Implement appropriate delays between requests
- Respect website robots.txt
-
Error Handling
- All scrapers include comprehensive error handling
- Validation of API keys and inputs
-
Data Management
- Results saved in structured JSON format
- Timestamp-based file naming
- Download functionality for results
- Fork the repository
- Create your feature branch (
git checkout -b feature/AmazingFeature
) - Commit your changes (
git commit -m 'Add some AmazingFeature'
) - Push to the branch (
git push origin feature/AmazingFeature
) - Open a Pull Request
This project is licensed under the MIT License - see the LICENSE file for details.