-
Notifications
You must be signed in to change notification settings - Fork 505
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
xla_device with args #2374
Comments
@ultrons FYI in case you were curious about this use case |
What does "independent" mean? |
there are no collective ops... each one runs on its own process with its own optimizer and dataset |
Then yes, no problem. As long as you do not call |
There is still no need to pass an ordinal to |
ok. i think we still call the xm.optimizer_step() |
Wait ... you do need to call |
@davidel I'm a little confused. How would a process select a different core without providing an ordinal? Will it automatically pick the next free core? Also, could you explain why |
The pytorch/xla multiprocessing automatically partitions the devices and assign a proper "current device" to each process. If you call |
I was just checking out the issues, and found this one. I don't know if you guys know, but @abhishekkrthakur invented this exact technique of training multiple folds on TPUs over here. Here's a YouTube video on the same. |
I meant when we run each process on a separate core manually instead of using multiprocessing, we would have to choose a device using
In the API Guide, it's mentioned to call |
@tmabraham This functionality was added based on his kernel :] |
@lezwon So the implementation is in kernel. So I am curious what is the confusion? |
@tmabraham So abhishek's kernel is basically demonstrates training K models parallelly. The functionality implemented in PyTorch Lightning supports both: training with multi-processing as well as training on a single core. Given the code differences between them, there are some issues we are trying to resolve during training and checkpointing to ensure a consistent experience to the user similar to that of training on GPUs. You can view this PR for more info. |
That code "happens to work" 😄 device = xm.xla_device(fold + 1) It assumes the positions of TPU devices starting from 1. Also, that code uses multi-threading, which is considerably slower 20..30% to multi-processing due to GIL serialization over the model's python code. |
Ok to summarize so far:
Davide mentioned @williamFalcon @lezwon @Borda let us know if you're still running into issues. |
Yes. You just cannot select a device. It gets assigned to you. If you need to know an ordinal, in order to create data samplers, use |
When I look at the documentation which mentions training on single-core, it says
Or is @davidel The code I wrote uses multi-threading because multi-processing needed 8 times the memory. It was not able to fit a multiprocessing model in kaggle kernels which have 16GB of RAM. So, the conclusion is, we call, |
In cases where you do not call
WRT OOM on 16GB Kaggle, did you try the
|
I can't use MPModelWrapper when I'm training a model on each core. Can I?
tiny keyboard, tiny message
Abhishek Thakur
https://www.linkedin.com/in/abhi1thakur/
…On Thu, Jul 30, 2020, 10:26 Davide Libenzi ***@***.***> wrote:
In cases where you do not call xm.optimizer_step() (like somewhere
mentioned in this thread), *AND* you are not using the ParallelLoader,
you need to call xm.mark_step().
Otherwise the ParallelLoader calls it itself:
https://github.com/pytorch/xla/blob/c4f8873d791e36e9819c102bac0e309d88b6ca8b/torch_xla/distributed/parallel_loader.py#L37
WRT OOM on 16GB Kaggle, did you try the MpModelWrapper :
https://github.com/pytorch/xla/blob/c4f8873d791e36e9819c102bac0e309d88b6ca8b/torch_xla/distributed/xla_multiprocessing.py#L398
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#2374 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAJA5UJGYWSUEQ5YPWVUIVLR6EVEBANCNFSM4PHEKJCA>
.
|
You could try the serial executor:
Where the function you pass to it is like: def _make_device_model(device):
model = MyModel(...)
return model.to(device)
def _serial_model_create(device):
model = _make_device_model(device)
gc.collect()
return model |
@davidel , If we do not call xm.optimizer_step, instead do a optimizer.step() followed by mark_step in the training loop, this would mean that no all reduce happens across the cores. Since each core is working on separate shard of dataset already, essentially we will be training 8 independent models in parallel and at the end of the training loop we can write out those models. Is that a right understanding? If so then probably @abhishekkrthakur can consider going that route. @abhishekkrthakur is that what you want to accomplish as k-fold training? |
Actually, it is a bit more complex. |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
PyTorch Lightning had some interest in a use case that we haven't explored much: running separate and independent processes on each TPU core, each with its own slice of the dataset. I think the use case was k-fold training and/or hyperparam tuning but I wanted @williamFalcon and @lezwon and @Borda to correct me if I'm wrong.
@ailzhang and @JackCaoG and @davidel are not convinced that this will work with the current system.
Right now, the API gives the option of choosing a device: https://github.com/pytorch/xla/blob/master/torch_xla/core/xla_model.py#L221
This makes it seem like this use case is possible and PyTorch Lightning was able to get it working on Colab but having some issues when running in GKE.
Should we remove this option from the API if we don't intend for anyone to use it?
The text was updated successfully, but these errors were encountered: