Skip to content

Conversation

mreso
Copy link
Contributor

@mreso mreso commented Jan 22, 2025

Start light house

RUST_BACKTRACE=1 torchft_lighthouse --min_replicas 1 --quorum_tick_ms 100 --join_timeout_ms 1000

Start worker 0

REPLICA_GROUP_ID=0 CUDA_VISIBLE_DEVICES=2,3 TORCHFT_MANAGER_PORT=29512 TORCHFT_LIGHTHOUSE=http://localhost:29510 torchrun --nnodes 1 --nproc-per-node 2 train_fsdp.py

Start worker1:

REPLICA_GROUP_ID=1 CUDA_VISIBLE_DEVICES=6,7 TORCHFT_MANAGER_PORT=29513 TORCHFT_LIGHTHOUSE=http://localhost:29510/ torchrun --nnodes 1 --nproc-per-node 2 --rdzv-endpoint=localhost:29400 train_fsdp.py

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Jan 22, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Meta Open Source bot.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants