-
Notifications
You must be signed in to change notification settings - Fork 505
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PyTorch not able to access all cores #1576
Comments
Can you try running https://github.com/pytorch/xla/blob/master/test/test_train_mp_mnist.py ? I don't think your setup is wrong, just that the way you're using the multiprocessing API may not be correct. Make sure to specify the correct number of processes to spawn:
For v3-8 it should be |
@jysohn23 Interestingly this code is working. However, if I type
it returns 1 |
Yes that's expected since your world is only a single process in that case. The |
Ah ok I didn't know that. Here is the error I am getting with my code:
Any idea what it could be due to? |
That means you are trying to replicate when the system see 8 local devices, and you are using 1. xla/test/test_train_mp_mnist.py Line 181 in 2463ede
|
|
Without |
Hi,
I did not use xmp.spawn to do predict at step2 because I did not find a way to return the predict result from xmp.spawn( save the result to disk is not my option).
Let me know if you want a new issue open. Thanks. |
@world2vec I think this is similar to #2268 and I think the answer is that it's not possible to use all cores after you use only one core. |
@tmabraham |
@world2vec you should be able to predict inside the spawned processes, why do you need to predict outside an def train(...)
...
def predict( ...)
...
if __name__ == '__main__':
xmp.spawn(train, args=(), nprocs=8)
xmp.spawn(predict, args=(), nprocs=1) |
@taylanbil
By the way I prefer this way, get the output from the spawn: I have another function post_process the outputs from model1 and model2, as the function name, predict is predict, post_process is post_process, train is train. Technically we can put all code in one function or a toy notebook, but that is not good. |
actually I tried to get this to work w/ a simple example and I'm running into problems. I don't suggest doing this. this is all doable with one spawn, is there a reason why that's not ok for you? def main(): # this is run on all 8 cores
train model 1 on sharded data
predict using model 1 on unsharded data # unnecessary, duplicate computation but not a big deal
train model 2 on sharded data
...
xmp.spawn(main, args=(), nprocs=8) |
well, I will say every time I need wrap all things in one function to use xmp.spawn, but technically yes we can. |
Hello all,
I set up a GCP instance of TPUv3-8, and a VM instance in the same region, with torch-nightly build. When I set up PyTorch XLA, I noticed that
xm.xrt_world_size()
returns zero butxm.get_xla_supported_devices()
returns all 8 devices. When I try to run code for training TPUs, it won't spawn 8 processes saying thatnprocs=8
is not allowed as it is more than thexm_wrt_world_size()
.Was the GCP instance set up wrong?
The text was updated successfully, but these errors were encountered: