Example train_ddp.py run command runs only one process, but there are two groups and it is not explained how to use both #162

rualark · 2025-04-15T00:54:24Z

See here: https://github.com/pytorch/torchft/tree/main?tab=readme-ov-file#example-training-loop-ddp

TORCHFT_LIGHTHOUSE=http://localhost:29510 torchrun --master_port 29501 --nnodes 1 --nproc_per_node 1 train_ddp.py

Two groups are specified by default: https://github.com/pytorch/torchft/blob/main/train_ddp.py#L35

It is not clear if the same process group (master port) should be used in both replication groups (so it is not clear if one DDP instance runs across multiple replication groups or not) - should probably add a comment here:

torchft/train_ddp.py

Line 34 in dc1037e

REPLICA_GROUP_ID = int(os.environ.get("REPLICA_GROUP_ID", 0))

WarrenZhu050413 · 2025-04-23T13:09:00Z

Hi Rualark, this is what worked for me.

After launching the lighthouse, you should launch two groups. Each torchrun group launch instantiates a ManagerServer that can communicate with the lighthouse, and corresponds to a "replica group" in the design doc.

export REPLICA_GROUP_ID=0
export NUM_REPLICA_GROUPS=2

CUDA_VISIBLE_DEVICES=0 TORCHFT_LIGHTHOUSE=http://localhost:29510 torchrun --nnodes=1 --nproc_per_node=1 -- train_ddp.py

export REPLICA_GROUP_ID=1
export NUM_REPLICA_GROUPS=2

CUDA_VISIBLE_DEVICES=1 TORCHFT_LIGHTHOUSE=http://localhost:29510 torchrun --nnodes=1 --nproc_per_node=1 -- train_ddp.py

rualark · 2025-04-23T17:10:52Z

Makes sense. Do you want to add this to README.md? Explaining what CUDA_VISIBLE_DEVICES in README.md should also help users?

WarrenZhu050413 · 2025-04-26T00:24:07Z

@d4l3k I would be happy to take a stab at this.

rualark changed the title ~~Example train_ddp.py run command runs only one process, but there are two groups and it is not explain how to use both~~ Example train_ddp.py run command runs only one process, but there are two groups and it is not explained how to use both Apr 15, 2025

d4l3k added the documentation Improvements or additions to documentation label Apr 25, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Example train_ddp.py run command runs only one process, but there are two groups and it is not explained how to use both #162

Example train_ddp.py run command runs only one process, but there are two groups and it is not explained how to use both #162

rualark commented Apr 15, 2025 •

edited

Loading

WarrenZhu050413 commented Apr 23, 2025

rualark commented Apr 23, 2025

WarrenZhu050413 commented Apr 26, 2025

Example train_ddp.py run command runs only one process, but there are two groups and it is not explained how to use both #162

Example train_ddp.py run command runs only one process, but there are two groups and it is not explained how to use both #162

Comments

rualark commented Apr 15, 2025 • edited Loading

WarrenZhu050413 commented Apr 23, 2025

rualark commented Apr 23, 2025

WarrenZhu050413 commented Apr 26, 2025

rualark commented Apr 15, 2025 •

edited

Loading