Skip to content

Example train_ddp.py run command runs only one process, but there are two groups and it is not explained how to use both #162

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
rualark opened this issue Apr 15, 2025 · 3 comments
Labels
documentation Improvements or additions to documentation

Comments

@rualark
Copy link

rualark commented Apr 15, 2025

See here: https://github.com/pytorch/torchft/tree/main?tab=readme-ov-file#example-training-loop-ddp

TORCHFT_LIGHTHOUSE=http://localhost:29510 torchrun --master_port 29501 --nnodes 1 --nproc_per_node 1 train_ddp.py

Two groups are specified by default: https://github.com/pytorch/torchft/blob/main/train_ddp.py#L35

It is not clear if the same process group (master port) should be used in both replication groups (so it is not clear if one DDP instance runs across multiple replication groups or not) - should probably add a comment here:

REPLICA_GROUP_ID = int(os.environ.get("REPLICA_GROUP_ID", 0))

@rualark rualark changed the title Example train_ddp.py run command runs only one process, but there are two groups and it is not explain how to use both Example train_ddp.py run command runs only one process, but there are two groups and it is not explained how to use both Apr 15, 2025
@WarrenZhu050413
Copy link

Hi Rualark, this is what worked for me.

After launching the lighthouse, you should launch two groups. Each torchrun group launch instantiates a ManagerServer that can communicate with the lighthouse, and corresponds to a "replica group" in the design doc.

export REPLICA_GROUP_ID=0
export NUM_REPLICA_GROUPS=2

CUDA_VISIBLE_DEVICES=0 TORCHFT_LIGHTHOUSE=http://localhost:29510 torchrun --nnodes=1 --nproc_per_node=1 -- train_ddp.py
export REPLICA_GROUP_ID=1
export NUM_REPLICA_GROUPS=2

CUDA_VISIBLE_DEVICES=1 TORCHFT_LIGHTHOUSE=http://localhost:29510 torchrun --nnodes=1 --nproc_per_node=1 -- train_ddp.py

@rualark
Copy link
Author

rualark commented Apr 23, 2025

Makes sense. Do you want to add this to README.md? Explaining what CUDA_VISIBLE_DEVICES in README.md should also help users?

@d4l3k d4l3k added the documentation Improvements or additions to documentation label Apr 25, 2025
@WarrenZhu050413
Copy link

@d4l3k I would be happy to take a stab at this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation
Projects
None yet
Development

No branches or pull requests

3 participants