A production-grade Instagram data crawling platform that uses multiple concurrent crawlers with proxy-based IP rotation to collect profiles, posts, reels, and comments at scale — outputting validated, deduplicated JSON for data-sharing.
- Multi-Crawler Pool — configurable concurrent workers (1–10)
- Proxy IP Rotation — round-robin assignment, health tracking, auto-failover
- Anti-Detection — randomized delays, User-Agent rotation, session cycling
- Pydantic Validation — all data strictly typed and validated
- JSON Data Store — organized per-user directories with deduplication
- Batch Jobs — define targets in YAML/JSON, crawl them all in parallel
- Rich CLI — beautiful terminal output with tables and status panels
# Create virtual environment
python3 -m venv venv
source venv/bin/activate
# Install dependencies
pip install -r requirements.txt
# Crawl a single account
python main.py crawl virat.kohli --limit 5 --delay 5
# Batch crawl from jobs file
python main.py batch --jobs jobs.yaml --workers 3
# Check data status
python main.py status
# Export latest data for a target
python main.py export virat.kohli
# Check proxy health
python main.py proxies --proxies proxies.txt| Command | Description |
|---|---|
crawl |
Crawl a single Instagram account |
batch |
Batch crawl multiple accounts from a job file |
status |
Show status of all crawled data |
export |
Print latest JSON data for a target |
proxies |
Check proxy list health and availability |
Add your proxies to proxies.txt (one per line):
http://proxy1.example.com:8080
socks5://user:password@proxy2.example.com:1080
https://proxy3.example.com:3128
If no proxy file is provided, the system runs in direct mode (your own IP).
targets:
- username: virat.kohli
data_types: [profile, posts, reels]
posts_limit: 20
- username: cristiano
data_types: [profile, posts]
posts_limit: 50output/
├── index.json # Master index of all crawls
├── virat.kohli/
│ ├── profile_20260219.json
│ ├── posts_20260219.json
│ ├── reels_20260219.json
│ └── latest.json # Full combined result
└── cristiano/
└── ...
All settings are configurable via environment variables (prefix AGENTDASH_):
| Variable | Default | Description |
|---|---|---|
AGENTDASH_MAX_WORKERS |
3 |
Concurrent crawler workers |
AGENTDASH_DELAY_MIN |
3.0 |
Min delay between requests (s) |
AGENTDASH_DELAY_MAX |
8.0 |
Max delay between requests (s) |
AGENTDASH_PROXY_FILE |
proxies.txt |
Proxy list file |
AGENTDASH_MAX_RETRIES |
3 |
Retries per failed request |
AGENTDASH_SESSION_ROTATE_AFTER |
15 |
Rotate session after N requests |
This tool accesses publicly available data only. Use responsibly and respect Instagram's rate limits and Terms of Service.