You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
As noted here and here, it looks like pytorch/xla's multiprocessing causes RAM usage to scale with the number of cores being used. Increasing the RAM is ok when n_cores=8, but if you're running on TPU Pod slice with a lot more cores, just increasing the RAM won't work.
Whats the recommended way to scale large models that take up ~10GB of RAM/core to a TPU pod with 32 or 64 cores? Would multithreading be the solution? Is there a performance difference between using start_method=fork/spawn
Thanks,
Bilal
The text was updated successfully, but these errors were encountered:
At the moment PyTorch TPU on POD requires a matching number of user VMs, per TPU VMs.
So each user VM will drive 8 TPU v3 cores in any case.
We are migrating the architecture so that the user and TPU VMs will be consolidated, and have a considerable higher number of cores and RAM.
❓ Questions and Help
Hi,
As noted here and here, it looks like pytorch/xla's multiprocessing causes RAM usage to scale with the number of cores being used. Increasing the RAM is ok when n_cores=8, but if you're running on TPU Pod slice with a lot more cores, just increasing the RAM won't work.
Whats the recommended way to scale large models that take up ~10GB of RAM/core to a TPU pod with 32 or 64 cores? Would multithreading be the solution? Is there a performance difference between using start_method=fork/spawn
Thanks,
Bilal
The text was updated successfully, but these errors were encountered: