-
Notifications
You must be signed in to change notification settings - Fork 346
Hpc setup #1004
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Hpc setup #1004
Conversation
Optional adds to the toml files for running in environments that uses non-torchrun launchers.
Aren’t those environment variables set automatically when using |
As I mentioned above, this is for non-torchrun launchers like mpirun/mpiexec etc. Torchrun is very specific to torch, HPC environments run many other applications which can not use torchrun. HPC environments are optimized for whatever launcher they were designed for. In those environments, torchrun usually does not work or has sub-optimal performance. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry I'm not familiar with the HPC setup you mentioned. I wonder over there could you set the environment variables outside train.py
like what we do in run_train.sh
?
Even if you have to do it within the python job, with the flexible job config support (we recently added), you can customize it without changing torchtitan, e.g. by using torchtitan Trainer
as a submodule.
https://github.com/pytorch/torchtitan/blob/main/docs/extension.md#extending-jobconfig
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you are able to specify --hpc. local_rank_var
, why don't you just prefix LOCAL_RANK=some_rank
to your launch command?
@tianyu-l and @fegin , thanks for your replies. Let me try to address the questions one by one below.
|
Hi @githubsgi, you might be interested on running something like this |
@TJ-Solergibert , it has the same issue as 1 above, if I understand the script. |
In torch distributed programs you have to set 5 environment variables:
In this example you can consider |
@TJ-Solergibert , thanks for pointing out the following line for SLURM launcher srun
The above also would benefit from simplification I am proposing - see below.
|
@githubsgi You can check https://github.com/pytorch/torchtitan/blob/main/docs/extension.md#extending-jobconfig. This should meet your goal instead of adding new ones to the main JobConfig. |
@fegin , I have already addressed the complexity of that approach above. In fact, the approach suggested by @TJ-Solergibert above is simpler and more maintainable. The 2 environment variables I am proposing is applicable to any launcher and thus makes TorchTitan code very portable, |
HPC type setup ( e.g. mpirun , mpiexec) currently requires code change in train.py ( or elsewhere) . Adding couple of optional additional args ( e.g. rank and local rank) to toml file along with a simple function in train.py removes the need to make those changes.