Metrics reduction on distributed TPU setting #965

vfdev-5 · 2020-04-22T22:33:08Z

🚀 Feature

Ignite will support distributed training on TPU (e.g. #960). Currently, metric's computation is impacted in the same way as for DDP on GPUs.

Idea is to improve metric's computation and reduce internal values as it is done for DDP:

ignite/ignite/metrics/metric.py

Line 92 in 0b36397

    
           def _sync_all_reduce(self, tensor: Union[torch.Tensor, numbers.Number]) -> Union[torch.Tensor, numbers.Number]:

To check if we are running in distributed TPU, we can opt to

# global definition
try:
     import torch_xla.core.xla_model as xm
     on_xla_device = True
except ImportError:
     on_xla_device = False

and if we need to reduce:

xm.xrt_world_size() > 1

This issues depends on #963

The text was updated successfully, but these errors were encountered:

vfdev-5 · 2020-04-23T12:27:22Z

The following monkeypatching works to reduce metrics

# We need to monkeypatch base Ignite metric class to work in distributed TPU
# Until full support of TPU on PyTorch-Ignite side : 
# https://github.com/pytorch/ignite/issues/965
import numbers
import torch
import torch_xla.core.xla_model as xm


def _tpu_sync_all_reduce(self, tensor):
    tensor_to_number = False
    if isinstance(tensor, numbers.Number):
        tensor = torch.tensor(tensor, device=self._device, dtype=torch.float)
        tensor_to_number = True

    if isinstance(tensor, torch.Tensor):
        # check if the tensor is at specified device
        if tensor.device != self._device:
            tensor = tensor.to(self._device)
    else:
        raise TypeError("Unhandled input type {}".format(type(tensor)))

    # synchronize and reduce
    xm.all_reduce("sum", [tensor, ])

    if tensor_to_number:
        return tensor.item()
    return tensor


from ignite.metrics import Metric


Metric._sync_all_reduce = _tpu_sync_all_reduce

https://colab.research.google.com/drive/1Gy8bblDyXYBqMI7PuLAcIDKzRZGSoeYl

vfdev-5 · 2020-04-25T23:00:09Z

Follow up about dtype support : pytorch/xla#1952

vfdev-5 added enhancement help wanted labels Apr 22, 2020

erip mentioned this issue Apr 27, 2020

[WIP] add TPU metrics reduction. #988

Closed

3 tasks

This was referenced May 16, 2020

Tpu metrics #1042

Merged

[WIP] Merge idist into master #1045

Merged

vfdev-5 closed this as completed in #1045 May 31, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Metrics reduction on distributed TPU setting #965

Metrics reduction on distributed TPU setting #965

vfdev-5 commented Apr 22, 2020 •

edited

Loading

vfdev-5 commented Apr 23, 2020

vfdev-5 commented Apr 25, 2020

Metrics reduction on distributed TPU setting #965

Metrics reduction on distributed TPU setting #965

Comments

vfdev-5 commented Apr 22, 2020 • edited Loading

🚀 Feature

vfdev-5 commented Apr 23, 2020

vfdev-5 commented Apr 25, 2020

vfdev-5 commented Apr 22, 2020 •

edited

Loading