Skip to content

chronometer/web-content-agent

Repository files navigation

Web Content Agent πŸ€–

An AI-powered agent that automatically discovers, downloads, and organizes web content based on your criteria. Built with browser-use and LangChain.

License: MIT Python Version

🌟 Features

  • Flexible Content Discovery: Configure the agent to search any website for any type of content
  • Smart Filtering: Uses GPT-4 to evaluate and select content based on your criteria
  • Template System: Pre-built templates for common use cases (research papers, news articles, etc.)
  • Visual History: Creates GIF recordings of browsing sessions for transparency
  • Customizable Output: Organize and format downloaded content your way

πŸš€ Quick Start

Prerequisites

  • Python 3.11 or higher
  • OpenAI API key
  • Git (for cloning)

Installation

  1. Clone the repository:
git clone https://github.com/chronometer/web-content-agent.git
cd web-content-agent
  1. Create a virtual environment:
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate
  1. Install dependencies:
pip install -r requirements.txt
playwright install
  1. Configure environment:
cp .env.example .env
# Edit .env and add your OpenAI API key

Usage

  1. Using templates:
python src/content_agent.py --template research --config config/config.yaml
  1. Custom task:
python src/content_agent.py --task "Find articles about AI" --urls "news.com,blog.com"

🎯 Configuration

Main Configuration (config/config.yaml)

agent:
  name: "Web Content Agent"
  model: "gpt-4o"
  timeout: 3600

output:
  directory: "downloads"
  naming_pattern: "{date}_{title}"

Task Templates

Templates are YAML files in config/templates/ that define specific search patterns:

  1. Research Papers (research.yaml):
task: """
Search for research papers about:
{topics}
From: {urls}
"""
parameters:
  topics:
    - "machine learning"
    - "AI"
  urls:
    - arxiv.org
    - scholar.google.com
  1. News Articles (news.yaml):
task: """
Find news articles about:
{topics}
Published between:
{date_from} and {date_to}
"""
parameters:
  date_from: "2024-01-01"
  date_to: "2024-12-31"

πŸ“ Output

The agent produces:

  • Downloaded content in downloads/
  • Summary reports with metadata
  • Visual browsing history (agent_history.gif)

πŸ› οΈ Creating Custom Templates

  1. Create a new YAML file in config/templates/:
name: "Custom Template"
description: "Your template description"
task: """
Your task description with
{parameter_placeholders}
"""
parameters:
  your_parameter:
    type: string
    description: "Parameter description"
    default: "Default value"
  1. Use the template:
python src/content_agent.py --template your_template

🀝 Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

  1. Fork the repository
  2. Create your feature branch (git checkout -b feature/AmazingFeature)
  3. Commit your changes (git commit -m 'Add some AmazingFeature')
  4. Push to the branch (git push origin feature/AmazingFeature)
  5. Open a Pull Request

πŸ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.

πŸ™ Acknowledgments

  • browser-use for the browser automation framework
  • LangChain for the LLM integration
  • All contributors and users of this project

⚠️ Disclaimer

This tool is for research purposes only. Please respect websites' terms of service and robots.txt when using this tool. Some websites may require authentication or have specific terms for automated access.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages