You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The following monkeypatching works to reduce metrics
# We need to monkeypatch base Ignite metric class to work in distributed TPU# Until full support of TPU on PyTorch-Ignite side : # https://github.com/pytorch/ignite/issues/965importnumbersimporttorchimporttorch_xla.core.xla_modelasxmdef_tpu_sync_all_reduce(self, tensor):
tensor_to_number=Falseifisinstance(tensor, numbers.Number):
tensor=torch.tensor(tensor, device=self._device, dtype=torch.float)
tensor_to_number=Trueifisinstance(tensor, torch.Tensor):
# check if the tensor is at specified deviceiftensor.device!=self._device:
tensor=tensor.to(self._device)
else:
raiseTypeError("Unhandled input type {}".format(type(tensor)))
# synchronize and reducexm.all_reduce("sum", [tensor, ])
iftensor_to_number:
returntensor.item()
returntensorfromignite.metricsimportMetricMetric._sync_all_reduce=_tpu_sync_all_reduce
🚀 Feature
Ignite will support distributed training on TPU (e.g. #960). Currently, metric's computation is impacted in the same way as for DDP on GPUs.
Idea is to improve metric's computation and reduce internal values as it is done for DDP:
ignite/ignite/metrics/metric.py
Line 92 in 0b36397
To check if we are running in distributed TPU, we can opt to
and if we need to reduce:
This issues depends on #963
The text was updated successfully, but these errors were encountered: