Skip to content

Hpc setup #1004

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: main
Choose a base branch
from
Open

Hpc setup #1004

wants to merge 2 commits into from

Conversation

githubsgi
Copy link
Contributor

HPC type setup ( e.g. mpirun , mpiexec) currently requires code change in train.py ( or elsewhere) . Adding couple of optional additional args ( e.g. rank and local rank) to toml file along with a simple function in train.py removes the need to make those changes.

Optional adds to the toml files for running in environments that uses non-torchrun launchers.
@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Mar 22, 2025
@JungHoyoun
Copy link
Contributor

Aren’t those environment variables set automatically when using torchrun command?

@githubsgi
Copy link
Contributor Author

As I mentioned above, this is for non-torchrun launchers like mpirun/mpiexec etc. Torchrun is very specific to torch, HPC environments run many other applications which can not use torchrun. HPC environments are optimized for whatever launcher they were designed for. In those environments, torchrun usually does not work or has sub-optimal performance.

Copy link
Contributor

@tianyu-l tianyu-l left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I'm not familiar with the HPC setup you mentioned. I wonder over there could you set the environment variables outside train.py like what we do in run_train.sh?

Even if you have to do it within the python job, with the flexible job config support (we recently added), you can customize it without changing torchtitan, e.g. by using torchtitan Trainer as a submodule.
https://github.com/pytorch/torchtitan/blob/main/docs/extension.md#extending-jobconfig

Copy link
Contributor

@fegin fegin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you are able to specify --hpc. local_rank_var , why don't you just prefix LOCAL_RANK=some_rank to your launch command?

@githubsgi
Copy link
Contributor Author

@tianyu-l and @fegin , thanks for your replies. Let me try to address the questions one by one below.

  1. Setting variable in run_train.sh : It does not work because of a subtle reason. The job launchers like torchrun/mpirun/mpiexec set the local and global rank variable uniquely in the shell/process of each rank. Hence, setting rank variables outside the launcher will make them all same or non-unique.
  2. LOCAL_RANK=some_rank to launch command : same as 1 above .
  3. Extending-jobconfig: it is certainly possible to make it work, but it would be far more complex than what is needed. Also, it would require maintenance of a separate piece of code elsewhere. Different launcher's already uniquely set the ranks variables per-rank process/shell . All that is left to be done is copy them to the variables that TorchTitan is expecting.

@TJ-Solergibert
Copy link
Contributor

Hi @githubsgi, you might be interested on running something like this

@githubsgi
Copy link
Contributor Author

githubsgi commented Mar 24, 2025

@TJ-Solergibert , it has the same issue as 1 above, if I understand the script.

@TJ-Solergibert
Copy link
Contributor

In torch distributed programs you have to set 5 environment variables:

  • MASTER_ADDR, MASTER_PORT & WORLD_SIZE which need to be COMMON for all the processes. This is done here
  • RANK & LOCAL_RANK which are PARTICULAR for every process. Recall that we have one process por GPU. This is done here.

In this example you can consider srun as mpirun and you set the number of processes through this 2 settings.

@githubsgi
Copy link
Contributor Author

githubsgi commented Mar 24, 2025

@TJ-Solergibert , thanks for pointing out the following line for SLURM launcher srun

srun $SRUN_ARGS bash -c "RANK=\$SLURM_PROCID LOCAL_RANK=\$SLURM_LOCALID $CMD" 2>&1 | tee -a $LOG_PATH .

The above also would benefit from simplification I am proposing - see below.

srun $SRUN_ARGS $CMD 2>&1 | tee -a $LOG_PATH

@fegin
Copy link
Contributor

fegin commented Mar 25, 2025

@githubsgi You can check https://github.com/pytorch/torchtitan/blob/main/docs/extension.md#extending-jobconfig. This should meet your goal instead of adding new ones to the main JobConfig.

@githubsgi
Copy link
Contributor Author

@fegin , I have already addressed the complexity of that approach above. In fact, the approach suggested by @TJ-Solergibert above is simpler and more maintainable. The 2 environment variables I am proposing is applicable to any launcher and thus makes TorchTitan code very portable,

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Meta Open Source bot.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants