Skip to content

Amanlook/dataCraw

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

🕸️ Agent Dash — Instagram Multi-Crawler Data System

A production-grade Instagram data crawling platform that uses multiple concurrent crawlers with proxy-based IP rotation to collect profiles, posts, reels, and comments at scale — outputting validated, deduplicated JSON for data-sharing.

Features

  • Multi-Crawler Pool — configurable concurrent workers (1–10)
  • Proxy IP Rotation — round-robin assignment, health tracking, auto-failover
  • Anti-Detection — randomized delays, User-Agent rotation, session cycling
  • Pydantic Validation — all data strictly typed and validated
  • JSON Data Store — organized per-user directories with deduplication
  • Batch Jobs — define targets in YAML/JSON, crawl them all in parallel
  • Rich CLI — beautiful terminal output with tables and status panels

Quick Start

# Create virtual environment
python3 -m venv venv
source venv/bin/activate

# Install dependencies
pip install -r requirements.txt

# Crawl a single account
python main.py crawl virat.kohli --limit 5 --delay 5

# Batch crawl from jobs file
python main.py batch --jobs jobs.yaml --workers 3

# Check data status
python main.py status

# Export latest data for a target
python main.py export virat.kohli

# Check proxy health
python main.py proxies --proxies proxies.txt

Commands

Command Description
crawl Crawl a single Instagram account
batch Batch crawl multiple accounts from a job file
status Show status of all crawled data
export Print latest JSON data for a target
proxies Check proxy list health and availability

Proxy Setup

Add your proxies to proxies.txt (one per line):

http://proxy1.example.com:8080
socks5://user:password@proxy2.example.com:1080
https://proxy3.example.com:3128

If no proxy file is provided, the system runs in direct mode (your own IP).

Job File Format (jobs.yaml)

targets:
  - username: virat.kohli
    data_types: [profile, posts, reels]
    posts_limit: 20

  - username: cristiano
    data_types: [profile, posts]
    posts_limit: 50

Output Structure

output/
├── index.json                 # Master index of all crawls
├── virat.kohli/
│   ├── profile_20260219.json
│   ├── posts_20260219.json
│   ├── reels_20260219.json
│   └── latest.json            # Full combined result
└── cristiano/
    └── ...

Configuration

All settings are configurable via environment variables (prefix AGENTDASH_):

Variable Default Description
AGENTDASH_MAX_WORKERS 3 Concurrent crawler workers
AGENTDASH_DELAY_MIN 3.0 Min delay between requests (s)
AGENTDASH_DELAY_MAX 8.0 Max delay between requests (s)
AGENTDASH_PROXY_FILE proxies.txt Proxy list file
AGENTDASH_MAX_RETRIES 3 Retries per failed request
AGENTDASH_SESSION_ROTATE_AFTER 15 Rotate session after N requests

⚠️ Disclaimer

This tool accesses publicly available data only. Use responsibly and respect Instagram's rate limits and Terms of Service.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages