You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
It is not clear if the same process group (master port) should be used in both replication groups (so it is not clear if one DDP instance runs across multiple replication groups or not) - should probably add a comment here:
The text was updated successfully, but these errors were encountered:
rualark
changed the title
Example train_ddp.py run command runs only one process, but there are two groups and it is not explain how to use both
Example train_ddp.py run command runs only one process, but there are two groups and it is not explained how to use both
Apr 15, 2025
After launching the lighthouse, you should launch two groups. Each torchrun group launch instantiates a ManagerServer that can communicate with the lighthouse, and corresponds to a "replica group" in the design doc.
See here: https://github.com/pytorch/torchft/tree/main?tab=readme-ov-file#example-training-loop-ddp
TORCHFT_LIGHTHOUSE=http://localhost:29510 torchrun --master_port 29501 --nnodes 1 --nproc_per_node 1 train_ddp.py
Two groups are specified by default: https://github.com/pytorch/torchft/blob/main/train_ddp.py#L35
It is not clear if the same process group (master port) should be used in both replication groups (so it is not clear if one DDP instance runs across multiple replication groups or not) - should probably add a comment here:
torchft/train_ddp.py
Line 34 in dc1037e
The text was updated successfully, but these errors were encountered: