Skip to content

bhancockio/deepseek-ai-web-crawler

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Deep Seek Crawler

This project is a web crawler built with Python that extracts venue data (wedding reception venues) from a website using asynchronous programming with Crawl4AI. It utilizes a language model-based extraction strategy and saves the collected data to a CSV file.

Features

  • Asynchronous web crawling using Crawl4AI
  • Data extraction powered by a language model (LLM)
  • CSV export of extracted venue information
  • Modular and easy-to-follow code structure ideal for beginners

Project Structure

.
├── main.py # Main entry point for the crawler
├── config.py # Contains configuration constants (Base URL, CSS selectors, etc.)
├── models
│ └── venue.py # Defines the Venue data model using Pydantic
├── utils
│ ├── init.py # (Empty) Package marker for utils
│ ├── data_utils.py # Utility functions for processing and saving data
│ └── scraper_utils.py # Utility functions for configuring and running the crawler
├── requirements.txt # Python package dependencies
├── .gitignore # Git ignore file (e.g., excludes .env and CSV files)
└── README.MD # This file

Installation

  1. Create and Activate a Conda Environment

    conda create -n deep-seek-crawler python=3.12 -y
    conda activate deep-seek-crawler
  2. Install Dependencies

    pip install -r requirements.txt
  3. Set Up Your Environment Variables

    Create a .env file in the root directory with content similar to:

    GROQ_API_KEY=your_groq_api_key_here

    (Note: The .env file is in your .gitignore, so it won’t be pushed to version control.)

Usage

To start the crawler, run:

python main.py

The script will crawl the specified website, extract data page by page, and save the complete venues to a complete_venues.csv file in the project directory. Additionally, usage statistics for the LLM strategy will be displayed after crawling.

Configuration

The config.py file contains key constants used throughout the project:

  • BASE_URL: The URL of the website from which to extract venue data.
  • CSS_SELECTOR: CSS selector string used to target venue content.
  • REQUIRED_KEYS: List of required fields to consider a venue complete.

You can modify these values as needed.

Additional Notes

  • Logging: The project currently uses print statements for status messages. For production or further development, consider integrating Python’s built-in logging module.
  • Improvements: The code is structured in multiple modules to maintain separation of concerns, making it easier for beginners to follow and extend the functionality.
  • Dependencies: Ensure that the package versions specified in requirements.txt are installed to avoid compatibility issues.

License

Include license information if applicable.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages