xla_device with args #2374

zcain117 · 2020-07-24T23:01:50Z

PyTorch Lightning had some interest in a use case that we haven't explored much: running separate and independent processes on each TPU core, each with its own slice of the dataset. I think the use case was k-fold training and/or hyperparam tuning but I wanted @williamFalcon and @lezwon and @Borda to correct me if I'm wrong.

@ailzhang and @JackCaoG and @davidel are not convinced that this will work with the current system.

Right now, the API gives the option of choosing a device: https://github.com/pytorch/xla/blob/master/torch_xla/core/xla_model.py#L221

This makes it seem like this use case is possible and PyTorch Lightning was able to get it working on Colab but having some issues when running in GKE.

Should we remove this option from the API if we don't intend for anyone to use it?

zcain117 · 2020-07-24T23:04:15Z

@ultrons FYI in case you were curious about this use case

davidel · 2020-07-25T10:18:54Z

What does "independent" mean?
Do they do any collective ops? If they do, they are dependent.
I am not sure what different slices of dataset mean here. Distributed training does that. Might be worst a better explanation.

williamFalcon · 2020-07-25T11:01:55Z

there are no collective ops... each one runs on its own process with its own optimizer and dataset

davidel · 2020-07-25T11:11:32Z

Then yes, no problem. As long as you do not call xm.optimizer_step().

davidel · 2020-07-25T13:36:07Z

There is still no need to pass an ordinal to xm.xla_device().

williamFalcon · 2020-07-25T13:55:55Z

ok. i think we still call the xm.optimizer_step()
i guess we can use the regular optimizer when operating in this mode @lezwon

davidel · 2020-07-25T13:58:19Z

Wait ... you do need to call xm.mark_step() though (instead of xm.optimizer_step().

lezwon · 2020-07-25T19:48:32Z

There is still no need to pass an ordinal to xm.xla_device().

@davidel I'm a little confused. How would a process select a different core without providing an ordinal? Will it automatically pick the next free core? Also, could you explain why xm.optimizer_step() should not be called?

davidel · 2020-07-25T19:51:28Z

The pytorch/xla multiprocessing automatically partitions the devices and assign a proper "current device" to each process.

If you call xm.optimizer_step() the different cores will try to reduce the gradients, hence the cores are not really independent.

tmabraham · 2020-07-26T02:33:55Z

I was just checking out the issues, and found this one.

I don't know if you guys know, but @abhishekkrthakur invented this exact technique of training multiple folds on TPUs over here. Here's a YouTube video on the same.

lezwon · 2020-07-26T02:35:50Z

The pytorch/xla multiprocessing automatically partitions the devices and assign a proper "current device" to each process.

I meant when we run each process on a separate core manually instead of using multiprocessing, we would have to choose a device using xla_device right?

If you call xm.optimizer_step() the different cores will try to reduce the gradients, hence the cores are not really independent.

In the API Guide, it's mentioned to call xm.optimizer_step(optimizer, barrier=True). Does this create a problem when running multiple processes parallel? What if we provide a replica group using the groups parameter? i.e xm.optimizer_step(optimizer, barrier=True, groups=[[xm.get_ordinal()]])

lezwon · 2020-07-26T02:37:25Z

@tmabraham This functionality was added based on his kernel :]

tmabraham · 2020-07-26T02:40:03Z

@lezwon So the implementation is in kernel. So I am curious what is the confusion?

lezwon · 2020-07-26T03:09:32Z

@tmabraham So abhishek's kernel is basically demonstrates training K models parallelly. The functionality implemented in PyTorch Lightning supports both: training with multi-processing as well as training on a single core. Given the code differences between them, there are some issues we are trying to resolve during training and checkpointing to ensure a consistent experience to the user similar to that of training on GPUs. You can view this PR for more info.

davidel · 2020-07-26T04:59:21Z

I was just checking out the issues, and found this one.

I don't know if you guys know, but @abhishekkrthakur invented this exact technique of training multiple folds on TPUs over here. Here's a YouTube video on the same.

That code "happens to work" 😄
It does the same:

device = xm.xla_device(fold + 1)

It assumes the positions of TPU devices starting from 1.
It should have called xm.get_xla_supported_devices(NUM_CORES) and index such list with fold.

Also, that code uses multi-threading, which is considerably slower 20..30% to multi-processing due to GIL serialization over the model's python code.

zcain117 · 2020-07-29T21:10:06Z

Ok to summarize so far:

specifying a TPU core via xm.xla_device(fold) will probably work but it's not a use case we test or promote
if specifying a TPU core:
- remember to use 0-indexed arg, i.e. fold instead of fold+1.
- maybe try device = xm.get_xla_supported_devices(NUM_CORES)[fold] instead of device = xm.xla_device(fold) if the latter isn't working.
- use xm.mark_step() instead of xm.optimizer_step since the latter will consolidate gradients between cores whereas you wanted your cores to be independent models.

Davide mentioned There is still no need to pass an ordinal to xm.xla_device().. I think his point was that each core is already independent (as long as you stop calling xm.optimizer_step). Instead of requesting a particular device, you could instead use our recommended flow and run the code without knowing before spawn time which device you'll end up on. In your code that runs on the TPU code, you can find which device you ended up on using code like this (example usage) and then maybe use something like torch.utils.data.Subset inside that core's code to make sure that core is using the right data. @davidel let me know if that seems right.

@williamFalcon @lezwon @Borda let us know if you're still running into issues.

davidel · 2020-07-30T06:18:48Z

Yes. You just cannot select a device. It gets assigned to you.
And xm.xla_device() will tell you what it is.
In case of thread based parallelism (which I would not use as it's deprecated, and 20..30% slower), the device get passed to the target function.

If you need to know an ordinal, in order to create data samplers, use xm.get_ordinal() (and xm.xrt_world_size() for the world size).

abhishekkrthakur · 2020-07-30T07:51:06Z

When I look at the documentation which mentions training on single-core, it says xm.optimizer_step with barrier=True and not xm.mark_step. Can this be updated to reflect what we have learnt here? Is the documentation deprecated?

import torch_xla.core.xla_model as xm

device = xm.xla_device()
model = MNIST().train().to(device)
loss_fn = nn.NLLLoss()
optimizer = optim.SGD(model.parameters(), lr=lr, momentum=momentum)

for data, target in train_loader:
  optimizer.zero_grad()
  data = data.to(device)
  target = target.to(device)
  output = model(data)
  loss = loss_fn(output, target)
  loss.backward()

  xm.optimizer_step(optimizer, barrier=True)

Or is mark_step used only when we specify the device?

@davidel The code I wrote uses multi-threading because multi-processing needed 8 times the memory. It was not able to fit a multiprocessing model in kaggle kernels which have 16GB of RAM.

So, the conclusion is, we call, xm.xla_device without index, we use mark_step and that's it. Right?

davidel · 2020-07-30T08:26:24Z

In cases where you do not call xm.optimizer_step() (like somewhere mentioned in this thread), AND you are not using the ParallelLoader, you need to call xm.mark_step().
Otherwise the ParallelLoader calls it itself:

xla/torch_xla/distributed/parallel_loader.py

Line 37 in c4f8873

xm.mark_step()

WRT OOM on 16GB Kaggle, did you try the MpModelWrapper :

xla/torch_xla/distributed/xla_multiprocessing.py

Line 398 in c4f8873

class MpModelWrapper(object):

abhishekkrthakur · 2020-07-30T09:01:02Z

I can't use MPModelWrapper when I'm training a model on each core. Can I? tiny keyboard, tiny message Abhishek Thakur https://www.linkedin.com/in/abhi1thakur/

…

On Thu, Jul 30, 2020, 10:26 Davide Libenzi ***@***.***> wrote: In cases where you do not call xm.optimizer_step() (like somewhere mentioned in this thread), *AND* you are not using the ParallelLoader, you need to call xm.mark_step(). Otherwise the ParallelLoader calls it itself: https://github.com/pytorch/xla/blob/c4f8873d791e36e9819c102bac0e309d88b6ca8b/torch_xla/distributed/parallel_loader.py#L37 WRT OOM on 16GB Kaggle, did you try the MpModelWrapper : https://github.com/pytorch/xla/blob/c4f8873d791e36e9819c102bac0e309d88b6ca8b/torch_xla/distributed/xla_multiprocessing.py#L398 — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#2374 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAJA5UJGYWSUEQ5YPWVUIVLR6EVEBANCNFSM4PHEKJCA> .

davidel · 2020-07-30T09:12:07Z

You could try the serial executor:

xla/torch_xla/distributed/xla_multiprocessing.py

Line 446 in c4f8873

class MpSerialExecutor(object):

Where the function you pass to it is like:

def _make_device_model(device):
  model = MyModel(...)
  return model.to(device)

def _serial_model_create(device):
  model = _make_device_model(device)
  gc.collect()
  return model

ultrons · 2020-07-31T16:47:08Z

@davidel , If we do not call xm.optimizer_step, instead do a optimizer.step() followed by mark_step in the training loop, this would mean that no all reduce happens across the cores. Since each core is working on separate shard of dataset already, essentially we will be training 8 independent models in parallel and at the end of the training loop we can write out those models. Is that a right understanding? If so then probably @abhishekkrthakur can consider going that route. @abhishekkrthakur is that what you want to accomplish as k-fold training?

davidel · 2020-07-31T17:20:38Z

Actually, it is a bit more complex.
The TPUs (in the way we configure them in replication mode) have a global barrier that all cores have to reach, before execution starts.
This means that the model described above only works if all the cores run the same number of TPU executions.
A totally independent training (where number of TPU execs is uneven across cores) requires changes and addition of a special mode.

stale · 2020-08-30T23:17:26Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

pytorch/xla#2374

lezwon mentioned this issue Aug 9, 2020

update docs #2411

Merged

stale bot added the stale Has not had recent activity label Aug 30, 2020

stale bot closed this as completed Sep 6, 2020

w32zhong added a commit to approach0/pya0 that referenced this issue Aug 26, 2021

xm.xla_device() is not expected to be passed with ordinal actually...

af1988c

pytorch/xla#2374

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

xla_device with args #2374

xla_device with args #2374

zcain117 commented Jul 24, 2020

zcain117 commented Jul 24, 2020

davidel commented Jul 25, 2020

williamFalcon commented Jul 25, 2020

davidel commented Jul 25, 2020

davidel commented Jul 25, 2020

williamFalcon commented Jul 25, 2020

davidel commented Jul 25, 2020 •

edited

Loading

lezwon commented Jul 25, 2020

davidel commented Jul 25, 2020

tmabraham commented Jul 26, 2020

lezwon commented Jul 26, 2020 •

edited

Loading

lezwon commented Jul 26, 2020

tmabraham commented Jul 26, 2020 •

edited

Loading

lezwon commented Jul 26, 2020

davidel commented Jul 26, 2020

zcain117 commented Jul 29, 2020

davidel commented Jul 30, 2020

abhishekkrthakur commented Jul 30, 2020

davidel commented Jul 30, 2020

abhishekkrthakur commented Jul 30, 2020 via email

davidel commented Jul 30, 2020

ultrons commented Jul 31, 2020

davidel commented Jul 31, 2020

stale bot commented Aug 30, 2020

xla_device with args #2374

xla_device with args #2374

Comments

zcain117 commented Jul 24, 2020

zcain117 commented Jul 24, 2020

davidel commented Jul 25, 2020

williamFalcon commented Jul 25, 2020

davidel commented Jul 25, 2020

davidel commented Jul 25, 2020

williamFalcon commented Jul 25, 2020

davidel commented Jul 25, 2020 • edited Loading

lezwon commented Jul 25, 2020

davidel commented Jul 25, 2020

tmabraham commented Jul 26, 2020

lezwon commented Jul 26, 2020 • edited Loading

lezwon commented Jul 26, 2020

tmabraham commented Jul 26, 2020 • edited Loading

lezwon commented Jul 26, 2020

davidel commented Jul 26, 2020

zcain117 commented Jul 29, 2020

davidel commented Jul 30, 2020

abhishekkrthakur commented Jul 30, 2020

davidel commented Jul 30, 2020

abhishekkrthakur commented Jul 30, 2020 via email

davidel commented Jul 30, 2020

ultrons commented Jul 31, 2020

davidel commented Jul 31, 2020

stale bot commented Aug 30, 2020

davidel commented Jul 25, 2020 •

edited

Loading

lezwon commented Jul 26, 2020 •

edited

Loading

tmabraham commented Jul 26, 2020 •

edited

Loading