Skip to content

Latest commit

Β 

History

History
105 lines (78 loc) Β· 3.47 KB

File metadata and controls

105 lines (78 loc) Β· 3.47 KB

πŸ•ΈοΈ Agent Dash β€” Instagram Multi-Crawler Data System

A production-grade Instagram data crawling platform that uses multiple concurrent crawlers with proxy-based IP rotation to collect profiles, posts, reels, and comments at scale β€” outputting validated, deduplicated JSON for data-sharing.

Features

  • Multi-Crawler Pool β€” configurable concurrent workers (1–10)
  • Proxy IP Rotation β€” round-robin assignment, health tracking, auto-failover
  • Anti-Detection β€” randomized delays, User-Agent rotation, session cycling
  • Pydantic Validation β€” all data strictly typed and validated
  • JSON Data Store β€” organized per-user directories with deduplication
  • Batch Jobs β€” define targets in YAML/JSON, crawl them all in parallel
  • Rich CLI β€” beautiful terminal output with tables and status panels

Quick Start

# Create virtual environment
python3 -m venv venv
source venv/bin/activate

# Install dependencies
pip install -r requirements.txt

# Crawl a single account
python main.py crawl virat.kohli --limit 5 --delay 5

# Batch crawl from jobs file
python main.py batch --jobs jobs.yaml --workers 3

# Check data status
python main.py status

# Export latest data for a target
python main.py export virat.kohli

# Check proxy health
python main.py proxies --proxies proxies.txt

Commands

Command Description
crawl Crawl a single Instagram account
batch Batch crawl multiple accounts from a job file
status Show status of all crawled data
export Print latest JSON data for a target
proxies Check proxy list health and availability

Proxy Setup

Add your proxies to proxies.txt (one per line):

http://proxy1.example.com:8080
socks5://user:password@proxy2.example.com:1080
https://proxy3.example.com:3128

If no proxy file is provided, the system runs in direct mode (your own IP).

Job File Format (jobs.yaml)

targets:
  - username: virat.kohli
    data_types: [profile, posts, reels]
    posts_limit: 20

  - username: cristiano
    data_types: [profile, posts]
    posts_limit: 50

Output Structure

output/
β”œβ”€β”€ index.json                 # Master index of all crawls
β”œβ”€β”€ virat.kohli/
β”‚   β”œβ”€β”€ profile_20260219.json
β”‚   β”œβ”€β”€ posts_20260219.json
β”‚   β”œβ”€β”€ reels_20260219.json
β”‚   └── latest.json            # Full combined result
└── cristiano/
    └── ...

Configuration

All settings are configurable via environment variables (prefix AGENTDASH_):

Variable Default Description
AGENTDASH_MAX_WORKERS 3 Concurrent crawler workers
AGENTDASH_DELAY_MIN 3.0 Min delay between requests (s)
AGENTDASH_DELAY_MAX 8.0 Max delay between requests (s)
AGENTDASH_PROXY_FILE proxies.txt Proxy list file
AGENTDASH_MAX_RETRIES 3 Retries per failed request
AGENTDASH_SESSION_ROTATE_AFTER 15 Rotate session after N requests

⚠️ Disclaimer

This tool accesses publicly available data only. Use responsibly and respect Instagram's rate limits and Terms of Service.