Deep Seek Crawler

This project is a web crawler built with Python that extracts venue data (wedding reception venues) from a website using asynchronous programming with Crawl4AI. It utilizes a language model-based extraction strategy and saves the collected data to a CSV file.

Features

Asynchronous web crawling using Crawl4AI
Data extraction powered by a language model (LLM)
CSV export of extracted venue information
Modular and easy-to-follow code structure ideal for beginners

Project Structure

.
├── main.py # Main entry point for the crawler
├── config.py # Contains configuration constants (Base URL, CSS selectors, etc.)
├── models
│ └── venue.py # Defines the Venue data model using Pydantic
├── utils
│ ├── init.py # (Empty) Package marker for utils
│ ├── data_utils.py # Utility functions for processing and saving data
│ └── scraper_utils.py # Utility functions for configuring and running the crawler
├── requirements.txt # Python package dependencies
├── .gitignore # Git ignore file (e.g., excludes .env and CSV files)
└── README.MD # This file

Installation

Create and Activate a Conda Environment

conda create -n deep-seek-crawler python=3.12 -y
conda activate deep-seek-crawler

Install Dependencies
```
pip install -r requirements.txt
```
Set Up Your Environment Variables

Create a .env file in the root directory with content similar to:
```
GROQ_API_KEY=your_groq_api_key_here
```
(Note: The .env file is in your .gitignore, so it won’t be pushed to version control.)

Usage

To start the crawler, run:

python main.py

The script will crawl the specified website, extract data page by page, and save the complete venues to a complete_venues.csv file in the project directory. Additionally, usage statistics for the LLM strategy will be displayed after crawling.

Configuration

The config.py file contains key constants used throughout the project:

BASE_URL: The URL of the website from which to extract venue data.
CSS_SELECTOR: CSS selector string used to target venue content.
REQUIRED_KEYS: List of required fields to consider a venue complete.

You can modify these values as needed.

Additional Notes

Logging: The project currently uses print statements for status messages. For production or further development, consider integrating Python’s built-in logging module.
Improvements: The code is structured in multiple modules to maintain separation of concerns, making it easier for beginners to follow and extend the functionality.
Dependencies: Ensure that the package versions specified in requirements.txt are installed to avoid compatibility issues.

License

Include license information if applicable.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Deep Seek Crawler

Features

Project Structure

Installation

Usage

Configuration

Additional Notes

License

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
models		models
utils		utils
.env.example		.env.example
.gitignore		.gitignore
README.MD		README.MD
complete_venues.csv		complete_venues.csv
config.py		config.py
main.py		main.py
requirements.txt		requirements.txt

bhancockio/deepseek-ai-web-crawler

Folders and files

Latest commit

History

Repository files navigation

Deep Seek Crawler

Features

Project Structure

Installation

Usage

Configuration

Additional Notes

License

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages