Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BC-breaking] Make Metrics accumulate values on device specified by user (#1232) #1238

Merged
merged 17 commits into from
Sep 11, 2020

Conversation

vfdev-5
Copy link
Collaborator

@vfdev-5 vfdev-5 commented Aug 16, 2020

  • update accuracy to accumulate _num_correct in a tensor on the right device

  • update loss metric to accumulate _sum in a tensor on the right device

  • update mae metric to accumulate in a tensor on the right device

  • update mpd metric to accumulate in a tensor on the right device

  • update mse metric to accumulate in a tensor on the right device

  • update top k accuracy metric to accumulate in a tensor on the right device

  • update precision and recall metrics to accumulate in tensors on the right device

  • black formatting

  • reverted run*.sh

  • change all metrics default device to cpu except running_average

  • Update ignite/metrics/precision.py

  • remove Optional type from metric devices since default is cpu

  • add comment explaining lack of detach in accuracy metrics

Fixes #1082

Original PR #1232

Description:

  • Merge PR to master

Check list:

  • New tests are added (if a new feature is added)
  • New doc strings: description and/or example code are in RST format
  • Documentation is updated (if required)

cc @n2cholas

* update accuracy to accumulate _num_correct in a tensor on the right device

* update loss metric to accumulate _sum in a tensor on the right device

* update mae metric to accumulate in a tensor on the right device

* update mpd metric to accumulate in a tensor on the right device

* update mse metric to accumulate in a tensor on the right device

* update top k accuracy  metric to accumulate in a tensor on the right device

* update precision and recall metrics to accumulate in tensors on the right device

* .....

* black formatting

* reverted run*.sh

* change all metrics default device to cpu except running_average

* Update ignite/metrics/precision.py

Co-authored-by: vfdev <[email protected]>

* remove Optional type from metric devices since default is cpu

* add comment explaining lack of detach in accuracy metrics

Co-authored-by: vfdev <[email protected]>
@vfdev-5
Copy link
Collaborator Author

vfdev-5 commented Aug 16, 2020

@n2cholas now I think we have to update the docstrings and main docs on metrics and how to use the device etc...
Please, send other PR to this branch from your fork such that we could update everything here and finally merge to master once we are done.

EDIT: I have to also check the code on real TPUs and run some local tests...

@vfdev-5
Copy link
Collaborator Author

vfdev-5 commented Aug 16, 2020

Seems like some tests on Horovod are failing: https://app.circleci.com/pipelines/github/pytorch/ignite/618/workflows/c74dff3a-5840-418e-837f-c43c4ac4b646/jobs/1308

I can inspect it later.

@n2cholas
Copy link
Contributor

@vfdev-5 The new tests I wrote checking that the accumulators are on the right device are failing. It's very strange that the other two_gpu tests are passing and only horovod is failing. Is it possible to run horovod on CPU only (like it was for TPUs)? If so, I can take a stab at it.

@vfdev-5
Copy link
Collaborator Author

vfdev-5 commented Aug 16, 2020

We compile Horovod from source on the CI : https://github.com/pytorch/ignite/blob/master/.circleci/config.yml#L234
You can try either simply install it from pip, so probably it will be only basic CPU version. Either compile it as in our CI without Nccl support. To run horovod on CPU only, you can use CUDA_VISIBLE_DEVICES="" pytest ....

@n2cholas
Copy link
Contributor

@vfdev-5 I was able to set horovod up and run the distributed tests on the CPU but could not reproduce the error from CI (the tests pass for me). Would you mind taking a look, when you have a chance?

@vfdev-5
Copy link
Collaborator Author

vfdev-5 commented Aug 17, 2020

@n2cholas thanks for the info ! Yes, I'll try to take a look as soon as possible.

@vfdev-5
Copy link
Collaborator Author

vfdev-5 commented Aug 17, 2020

@n2cholas just checked one test pytest tests/ignite/metrics/test_accuracy.py::test_distrib_hvd on 1 GPU with Horovod installed with NCCL support and I could reproduce the error:

pytest tests/ignite/metrics/test_accuracy.py::test_distrib_hvd

And I confirm that running the test on CPU is passing.

The problem more precisely is about

Mon Aug 17 08:15:55 2020[0]<stderr>:    assert acc._num_correct.device == device, "{} vs {}".format(acc._num_correct.device, device)                                        
Mon Aug 17 08:15:55 2020[0]<stderr>:AssertionError: cuda:0 vs cuda 

@vfdev-5
Copy link
Collaborator Author

vfdev-5 commented Aug 17, 2020

@n2cholas I checked the tests of accuracy and I proposed a way to improve them and also fix the issue:
b8a4355
Could you please adapt all other tests like that. Thanks !

EDIT: shouldn't we update RunningAverage metric, at least handle device properly ?

@n2cholas
Copy link
Contributor

n2cholas commented Aug 17, 2020

@vfdev-5 Thanks for the improved tests, I'll update the rest soon.

For the RunningAverage metric, I updated the metric to use src's device when src is a Metric, and otherwise use the device parameter. However, this will only accumulate on the desired device when the input is a tensor. You will see what I mean if you take a look at 90b5b85. Please let me know if you think the behaviour should be different.

Since I'm committing to my metrics_impl branch, this PR and the docs PR seem to be getting mixed up. Should I be using two different branches? I was not able to commit directly to this branch (ignite/metrics_impl)

@vfdev-5
Copy link
Collaborator Author

vfdev-5 commented Aug 17, 2020

For the RunningAverage metric, I updated the metric to use src's device when src is a Metric, and otherwise use the device parameter. However, this will only accumulate on the desired device when the input is a tensor. You will see what I mean if you take a look at 90b5b85. Please let me know if you think the behaviour should be different.

OK, I understand. Maybe we should check RunningAverage.src.device vs argument input device and raise an error if different or either override RunningAverage.src.device... What would be more helpful for the user ?

Since I'm committing to my metrics_impl branch, this PR and the docs PR seem to be getting mixed up. Should I be using two different branches?

I will merge #1239 once we addressed all visible issues and it will be in pytorch/metrics_impl branch. I know that this is not an ideal workflow but for instance it is an eco compromise between running gpu tests on all PRs and only on project's branches...

@n2cholas
Copy link
Contributor

OK, I understand. Maybe we should check RunningAverage.src.device vs argument input device and raise an error if different or either override RunningAverage.src.device... What would be more helpful for the user ?

That makes sense to me. So, by default it will be None, and we will use RunningAverage.src._device in this case. If the user passes an actual device in we check to make sure it's the same as the device for src. I'll fix this and write the tests.

I will merge #1239 once we addressed all visible issues and it will be in pytorch/metrics_impl branch. I know that this is not an ideal workflow but for instance it is an eco compromise between running gpu tests on all PRs and only on project's branches...

Got it, thanks!

@vfdev-5
Copy link
Collaborator Author

vfdev-5 commented Aug 17, 2020

That makes sense to me. So, by default it will be None, and we will use RunningAverage.src._device in this case. If the user passes an actual device in we check to make sure it's the same as the device for src. I'll fix this and write the tests.

Maybe None wont work if we use output instead of metric as source... It would be better probably to keep the same init as for other metrics

@n2cholas
Copy link
Contributor

Maybe None wont work if we use output instead of metric as source... It would be better probably to keep the same init as for other metrics

In the current logic, the init has None as the default, then if an output is used instead of a metric and the user doesn't specify a device, we use torch.device("cpu").

My only concern with using CPU as the default argument is if the user passes in a metric on the GPU, they will have to specify the same device as an argument to RunningAverage, otherwise we throw an error. With the current implementation, we can check if device is None and then use the same device as the specified metric automatically. Then, we only throw an error when the user explicitly passes in the wrong device for running average.

Also, I would like to clarify one thing: we only use the specified device for the accumulator when the output/metric.compute produces a tensor (since when it is just a python scalar, we don't want to re-wrap it in a tensor). Is this behaviour acceptable? If so, I already made a note for this in the docstring. Should I clarify this anywhere else?

@vfdev-5
Copy link
Collaborator Author

vfdev-5 commented Aug 17, 2020

In the current logic, the init has None as the default, then if an output is used instead of a metric and the user doesn't specify a device, we use torch.device("cpu").
My only concern with using CPU as the default argument is if the user passes in a metric on the GPU, they will have to specify the same device as an argument to RunningAverage, otherwise we throw an error. With the current implementation, we can check if device is None and then use the same device as the specified metric automatically. Then, we only throw an error when the user explicitly passes in the wrong device for running average.

Yes, you are right. Sounds good to me.

Also, I would like to clarify one thing: we only use the specified device for the accumulator when the output/metric.compute produces a tensor (since when it is just a python scalar, we don't want to re-wrap it in a tensor). Is this behaviour acceptable? If so, I already made a note for this in the docstring. Should I clarify this anywhere else?

Do we have a cases when updated output produces a scalar and not tensor ? Maybe I didn't understand your point about "output/metric.compute produces a tensor"...
I agree that it does not make much sense to wrap counters like self._num_examples etc.

Should I clarify this anywhere else?

If this could help user to better understand what happens under the hood, I'd say yes. Well, let me see the complete doc and we can decide later if we need to add something specific...

@n2cholas
Copy link
Contributor

Do we have a cases when updated output produces a scalar and not tensor ? Maybe I didn't understand your point about "output/metric.compute produces a tensor"...
I agree that it does not make much sense to wrap counters like self._num_examples etc.

I believe most of the metrics's compute method produces a float instead of a tensor. For example, we can look at MeanSquaredError.compute(), which returns a float. So, if we use RunningAverage(MeanSquaredError()), every time we call compute, we have the following logic:

    def compute(self) -> Union[torch.Tensor, float]:
        if self._value is None:
            self._value = self._get_src_value()
        else:
            self._value = self._value * self.alpha + (1.0 - self.alpha) * self._get_src_value()

        return self._value

We call MSE's compute via RunningAverage._get_src_value(), which returns a float, therefore self._value will be a float instead of being a tensor on the device the user specified.

I think this is fine, but I want to make sure it's what we want.

@vfdev-5
Copy link
Collaborator Author

vfdev-5 commented Aug 17, 2020

Ah OK, I understand now. Thanks for the explanation. Yes, me too, I think this is fine.

sdesrozis and others added 2 commits August 18, 2020 09:13
* update accuracy to accumulate _num_correct in a tensor on the right device

* update loss metric to accumulate _sum in a tensor on the right device

* update mae metric to accumulate in a tensor on the right device

* update mpd metric to accumulate in a tensor on the right device

* update mse metric to accumulate in a tensor on the right device

* update top k accuracy  metric to accumulate in a tensor on the right device

* update precision and recall metrics to accumulate in tensors on the right device

* .....

* black formatting

* reverted run*.sh

* change all metrics default device to cpu except running_average

* Update ignite/metrics/precision.py

Co-authored-by: vfdev <[email protected]>

* remove Optional type from metric devices since default is cpu

* add comment explaining lack of detach in accuracy metrics

* update docstrings and docs

* Update ignite/metrics/accumulation.py

Co-authored-by: vfdev <[email protected]>

* Update ignite/metrics/accumulation.py

Co-authored-by: vfdev <[email protected]>

* Update ignite/metrics/accumulation.py

Co-authored-by: vfdev <[email protected]>

* Update ignite/metrics/accuracy.py

Co-authored-by: vfdev <[email protected]>

* Update ignite/metrics/fbeta.py

Co-authored-by: vfdev <[email protected]>

* Update ignite/metrics/loss.py

Co-authored-by: vfdev <[email protected]>

* Update ignite/metrics/metric.py

Co-authored-by: vfdev <[email protected]>

* Update ignite/metrics/precision.py

Co-authored-by: vfdev <[email protected]>

* Update ignite/metrics/recall.py

Co-authored-by: vfdev <[email protected]>

* add comment explaining lack of detach in metrics docs

* support device argument for running_average

* update support for device argumenet for accumulation

* fix and improve device tests for metrics

* fix and improve device tests for metrics

* fix TPU tests

* Apply suggestions from code review

* Apply suggestions from code review

Co-authored-by: vfdev <[email protected]>
@vfdev-5
Copy link
Collaborator Author

vfdev-5 commented Aug 18, 2020

@n2cholas if you could add distritbuted tests that reproduce the problem from #1242 it would be awesome (at least for problem 1 and 3 as you say that they can be fixed)

@vfdev-5
Copy link
Collaborator Author

vfdev-5 commented Aug 18, 2020

@fco-dv @sdesrozis please start to review

@n2cholas
Copy link
Contributor

Seeing the colab, this corresponds to single TPU execution and not CPU. If NUM_TPU_WORKERS=8 is set we run on 8 devices in distrib mode, otherwise it picks a single TPU and run on it.

Thanks for catching this!

I propose to create a script to measure execution times, a sort of integration test with Engine and metrics:

  • in a single test, let's generate random y_pred, y to compute metrics on for 100 iterations, 3 epochs.

  • reimplement metrics like Accuracy and Accumulation to keep current master implementation (let's call it "as number" implementation vs this PR's "as tensor" implementation)

  • run benchmarks per metric on both implementations

    • let's run 3-5 time each test for better time stats

    • in various confs:

      • only on CPU, single device, no distributed
      • only on GPU, single device, no distributed
      • only on GPUs, two devices, distributed
      • only on TPU, single device, no distributed
      • only on TPUs, 8 devices, distributed

What do you think ?

This sounds good. I'll work on this soon.

I wonder what would be better to do:

  • detach tensors in output of update(output) inside iteration_completed
    or
  • explicitly in each update function.

In the first case, as we do not restrict the API and user can use either attach method or update/compute methods. If update/compute methods are used, that it is up to user to detach things before calling update or use torch.no_grad() context manager. Is it acceptable as requirement ?

I would prefer to detach explicitly in each update function so the user doesn't have to think about it, as in the current master users don't need to worry about detaching. But I'm not too familiar with with how the metrics are used in the engine (I use ignite solely for the metrics), so I will defer to your judgement on this decision.

I addressed the comments in my branch (n2cholas/ignite:metrics_impl), and am not sure how to push those commits into this branch. Should I open a separate pull request...?

@vfdev-5
Copy link
Collaborator Author

vfdev-5 commented Aug 31, 2020

I would prefer to detach explicitly in each update function so the user doesn't have to think about it, as in the current master users don't need to worry about detaching. But I'm not too familiar with with how the metrics are used in the engine (I use ignite solely for the metrics), so I will defer to your judgement on this decision.

Metrics reset-update-compute API can be introduced as

acc = Accuracy()

acc.reset()

with torch.no_grad():
    acc.update((y_pred1, y1))

with torch.no_grad():
    acc.update((y_pred2, y2))

print(acc.compute())

In this case, detach is not needed in the update. Otherwise, we have ensure everywhere that we detached y_pred and y. This is just 2 additional lines of code, not a big deal though.

But I'm not too familiar with with how the metrics are used in the engine (I use ignite solely for the metrics)

@n2cholas try it with the Engine :)

I addressed the comments in my branch (n2cholas/ignite:metrics_impl), and am not sure how to push those commits into this branch. Should I open a separate pull request...?

Well, technically, yes, you have to send PRs to this branch. But if you plan to work more this week, exceptionally I can give you a temporary write permission to push directly to this branch.

@n2cholas
Copy link
Contributor

n2cholas commented Sep 1, 2020

Metrics reset-update-compute API can be introduced as

acc = Accuracy()

acc.reset()

with torch.no_grad():
    acc.update((y_pred1, y1))

with torch.no_grad():
    acc.update((y_pred2, y2))

print(acc.compute())

In this case, detach is not needed in the update. Otherwise, we have ensure everywhere that we detached y_pred and y. This is just 2 additional lines of code, not a big deal though.

I see! What about changing the update method so it has the torch.no_grad() decorator, as follows:

@torch.no_grad()
@reinit__is_reduced
def update(self, output):
    ...

I'm just wondering if there's a good reason to push this overhead to the user. Is there any case where the user would not want the tensors to be deatched?

@n2cholas try it with the Engine :)

Haha will do on my next PyTorch project!

Well, technically, yes, you have to send PRs to this branch. But if you plan to work more this week, exceptionally I can give you a temporary write permission to push directly to this branch.

No worries, I was just curious. I will work on the benchmarks then create a PR as usual.

@n2cholas
Copy link
Contributor

n2cholas commented Sep 6, 2020

I've run the integration benchmarks on CPU, GPU, and multi-GPU. Here is a gist with the code and here are the numbers I got (single gpu tests are on titan V, multi-gpu on titan V + titan RTX). The error represents the margin of error for a 95% confidence interval (you can see how it's generated in the gist). Overall, the results are what we expect: the new implementation is slightly slower for CPU workloads, but slightly faster for GPU and distributed workloads.

I was trying to see what was going on with XLA, and I think for now we should not allow users to keep metrics on an XLA device, and instead they should use the metrics on the normal CPU when training on the TPU. We can throw an error if they try to set the metric's device to an XLA device. I'm not sure how to fix the performance issues we were seeing. Does this sound good @vfdev-5?

@vfdev-5
Copy link
Collaborator Author

vfdev-5 commented Sep 6, 2020

@n2cholas thanks a lot for the benchmark ! It is very helpful to see complete picture of that !

I was trying to see what was going on with XLA, and I think for now we should not allow users to keep metrics on an XLA device, and instead they should use the metrics on the normal CPU when training on the TPU. We can throw an error if they try to set the metric's device to an XLA device. I'm not sure how to fix the performance issues we were seeing. Does this sound good @vfdev-5?

I agree that we can use CPU instead of XLA for now, before finding out where the problem and if it can be solved somehow...
Probably, the issue is that XLA compiler should create a graph for all those updates and thus it takes time...

@n2cholas n2cholas mentioned this pull request Sep 7, 2020
3 tasks
n2cholas and others added 4 commits September 9, 2020 22:14
* update accuracy to accumulate _num_correct in a tensor on the right device

* update loss metric to accumulate _sum in a tensor on the right device

* update mae metric to accumulate in a tensor on the right device

* update mpd metric to accumulate in a tensor on the right device

* update mse metric to accumulate in a tensor on the right device

* update top k accuracy  metric to accumulate in a tensor on the right device

* update precision and recall metrics to accumulate in tensors on the right device

* .....

* black formatting

* reverted run*.sh

* change all metrics default device to cpu except running_average

* Update ignite/metrics/precision.py

Co-authored-by: vfdev <[email protected]>

* remove Optional type from metric devices since default is cpu

* add comment explaining lack of detach in accuracy metrics

* update docstrings and docs

* Update ignite/metrics/accumulation.py

Co-authored-by: vfdev <[email protected]>

* Update ignite/metrics/accumulation.py

Co-authored-by: vfdev <[email protected]>

* Update ignite/metrics/accumulation.py

Co-authored-by: vfdev <[email protected]>

* Update ignite/metrics/accuracy.py

Co-authored-by: vfdev <[email protected]>

* Update ignite/metrics/fbeta.py

Co-authored-by: vfdev <[email protected]>

* Update ignite/metrics/loss.py

Co-authored-by: vfdev <[email protected]>

* Update ignite/metrics/metric.py

Co-authored-by: vfdev <[email protected]>

* Update ignite/metrics/precision.py

Co-authored-by: vfdev <[email protected]>

* Update ignite/metrics/recall.py

Co-authored-by: vfdev <[email protected]>

* add comment explaining lack of detach in metrics docs

* support device argument for running_average

* update support for device argumenet for accumulation

* fix and improve device tests for metrics

* fix and improve device tests for metrics

* fix TPU tests

* Apply suggestions from code review

* Apply suggestions from code review

* detach tensors earlier in update

* remove redundant to() call

* ensure metrics aren't created on XLA devices

* Fixed isort

* move xla check to Metric.__init__ instead of individual metrics

* update xla tests

* replace deleted callable check

* remove redundant precision and recall __init__

* replace precision/recall __init__ for docs rendering

* add support for metrics_lambda with components on diff devices

Co-authored-by: vfdev <[email protected]>
Co-authored-by: n2cholas <[email protected]>
* update accuracy to accumulate _num_correct in a tensor on the right device

* update loss metric to accumulate _sum in a tensor on the right device

* update mae metric to accumulate in a tensor on the right device

* update mpd metric to accumulate in a tensor on the right device

* update mse metric to accumulate in a tensor on the right device

* update top k accuracy  metric to accumulate in a tensor on the right device

* update precision and recall metrics to accumulate in tensors on the right device

* .....

* black formatting

* reverted run*.sh

* change all metrics default device to cpu except running_average

* Update ignite/metrics/precision.py

Co-authored-by: vfdev <[email protected]>

* remove Optional type from metric devices since default is cpu

* add comment explaining lack of detach in accuracy metrics

* update docstrings and docs

* Update ignite/metrics/accumulation.py

Co-authored-by: vfdev <[email protected]>

* Update ignite/metrics/accumulation.py

Co-authored-by: vfdev <[email protected]>

* Update ignite/metrics/accumulation.py

Co-authored-by: vfdev <[email protected]>

* Update ignite/metrics/accuracy.py

Co-authored-by: vfdev <[email protected]>

* Update ignite/metrics/fbeta.py

Co-authored-by: vfdev <[email protected]>

* Update ignite/metrics/loss.py

Co-authored-by: vfdev <[email protected]>

* Update ignite/metrics/metric.py

Co-authored-by: vfdev <[email protected]>

* Update ignite/metrics/precision.py

Co-authored-by: vfdev <[email protected]>

* Update ignite/metrics/recall.py

Co-authored-by: vfdev <[email protected]>

* add comment explaining lack of detach in metrics docs

* support device argument for running_average

* update support for device argumenet for accumulation

* fix and improve device tests for metrics

* fix and improve device tests for metrics

* fix TPU tests

* Apply suggestions from code review

* Apply suggestions from code review

* detach tensors earlier in update

* remove redundant to() call

* ensure metrics aren't created on XLA devices

* Fixed isort

* move xla check to Metric.__init__ instead of individual metrics

* update xla tests

* replace deleted callable check

* remove redundant precision and recall __init__

* replace precision/recall __init__ for docs rendering

* add support for metrics_lambda with components on diff devices

* fix epoch_metric xla test

Co-authored-by: vfdev <[email protected]>
Co-authored-by: n2cholas <[email protected]>
Copy link
Collaborator Author

@vfdev-5 vfdev-5 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@n2cholas I left few comments. I still have the question about explicit detach calls. Now we have some classes where we call it directly on the output and other where we are not calling it at all... I think it would be better to make everything in the same way : call detach on the output. What do you think ?

EDIT: There some failing horovod tests on gpu to fix : https://app.circleci.com/jobs/github/pytorch/ignite/1477

@@ -67,6 +72,7 @@ def update(self, output: Union[Any, torch.Tensor, numbers.Number]) -> None:
output = output.to(self._device)

self.accumulator = self._op(self.accumulator, output)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we detach output if it is a tensor here ? To have something like

if self._device is not None:
    # Put output to the metric's device
    if isinstance(output, torch.Tensor):
        output = output.detach()
        if (output.device != self._device):
            output = output.to(self._device)

self.accumulator = self._op(self.accumulator, output)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we should, thanks for catching this.

@@ -161,11 +163,12 @@ def update(self, output: Sequence[torch.Tensor]) -> None:
y = torch.transpose(y, 1, last_dim - 1).reshape(-1, num_classes)
correct = torch.all(y == y_pred.type_as(y), dim=-1)

self._num_correct += torch.sum(correct).item()
# Don't need to detach here because torch.eq is not differentiable, so the computation graph is detached anyway.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's remove this comment and call detach on the output as in other classes:

# y_pred, y = output
y_pred, y = output[0].detach(), output[1].detach()

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good!

@n2cholas
Copy link
Contributor

@vfdev-5 I agree, I'll ensure all the metrics consistently call detach on the outputs (and fix those horovod tests too).

n2cholas and others added 2 commits September 11, 2020 17:14
* update accuracy to accumulate _num_correct in a tensor on the right device

* update loss metric to accumulate _sum in a tensor on the right device

* update mae metric to accumulate in a tensor on the right device

* update mpd metric to accumulate in a tensor on the right device

* update mse metric to accumulate in a tensor on the right device

* update top k accuracy  metric to accumulate in a tensor on the right device

* update precision and recall metrics to accumulate in tensors on the right device

* .....

* black formatting

* reverted run*.sh

* change all metrics default device to cpu except running_average

* Update ignite/metrics/precision.py

Co-authored-by: vfdev <[email protected]>

* remove Optional type from metric devices since default is cpu

* add comment explaining lack of detach in accuracy metrics

* update docstrings and docs

* Update ignite/metrics/accumulation.py

Co-authored-by: vfdev <[email protected]>

* Update ignite/metrics/accumulation.py

Co-authored-by: vfdev <[email protected]>

* Update ignite/metrics/accumulation.py

Co-authored-by: vfdev <[email protected]>

* Update ignite/metrics/accuracy.py

Co-authored-by: vfdev <[email protected]>

* Update ignite/metrics/fbeta.py

Co-authored-by: vfdev <[email protected]>

* Update ignite/metrics/loss.py

Co-authored-by: vfdev <[email protected]>

* Update ignite/metrics/metric.py

Co-authored-by: vfdev <[email protected]>

* Update ignite/metrics/precision.py

Co-authored-by: vfdev <[email protected]>

* Update ignite/metrics/recall.py

Co-authored-by: vfdev <[email protected]>

* add comment explaining lack of detach in metrics docs

* support device argument for running_average

* update support for device argumenet for accumulation

* fix and improve device tests for metrics

* fix and improve device tests for metrics

* fix TPU tests

* Apply suggestions from code review

* Apply suggestions from code review

* detach tensors earlier in update

* remove redundant to() call

* ensure metrics aren't created on XLA devices

* Fixed isort

* move xla check to Metric.__init__ instead of individual metrics

* update xla tests

* replace deleted callable check

* remove redundant precision and recall __init__

* replace precision/recall __init__ for docs rendering

* add support for metrics_lambda with components on diff devices

* fix epoch_metric xla test

* detach output consistently for all metrics

* fix horovod two gpu tests

* make confusion matrix detaches like other metrics

Co-authored-by: vfdev <[email protected]>
Co-authored-by: n2cholas <[email protected]>
Copy link
Collaborator Author

@vfdev-5 vfdev-5 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@n2cholas thanks a lot for working on this huge PR !
I'll fix the CI failing on TPUs and we can merge it

@vfdev-5 vfdev-5 merged commit 002b595 into master Sep 11, 2020
@vfdev-5 vfdev-5 deleted the metrics_impl branch September 11, 2020 23:54
@vfdev-5
Copy link
Collaborator Author

vfdev-5 commented Sep 16, 2020

@n2cholas we are preparing a v0.4.2 release and I think as being BC-breaking change this commit will be present since v0.5.0.

vfdev-5 added a commit that referenced this pull request Sep 16, 2020
@n2cholas
Copy link
Contributor

@vfdev-5 Sounds good, thank you!

vfdev-5 added a commit that referenced this pull request Sep 19, 2020
@vfdev-5 vfdev-5 mentioned this pull request Sep 22, 2020
vfdev-5 added a commit to vfdev-5/ignite that referenced this pull request Oct 4, 2020
* Updated ImageNet example (pytorch#1138)

* [WIP] Updated ImageNet example
- minor fixes for Pascal VOC12

* Fixed flake8

* Updated pytorch-version-tests.yml to run cron every day at 00:00 UTC (pytorch#1141)

Co-authored-by: Sylvain Desroziers <[email protected]>

* Added check_compute_fn argument to EpochMetric and related metrics (pytorch#1140)

* Added check_compute_fn argument to EpochMetric and related functions.

* Updated docstrings

* Added check_compute_fn to _BaseRegressionEpoch

* Adding typing hints for check_compute_fn

* Update roc_auc.py

Co-authored-by: Sylvain Desroziers <[email protected]>
Co-authored-by: vfdev <[email protected]>

* Docs cosmetics (pytorch#1142)

* Updated docs, replaced single quote by double quote if is code
- fixed missing link to Engine
- cosmetics

* More doc updates

* More updates

* Fix batch size calculation error (pytorch#1137)

* Fix batch size calculation error

* Add tests for fixed batch size calculation

* Fix tests

* Test for num_workers

* Fix nproc comparison

* Improve docs

* Fixed docstring

Co-authored-by: vfdev <[email protected]>

* Docs updates (pytorch#1139)

* [WIP] Added teaser gif

* [WIP] Updated README

* [WIP] Updated README

* [WIP] Updated docs

* Reverted unintended pyproject.toml edits

* Updated README and examples parts

* More updates of README

* Added badge to check pytorch/python compatible versions

* Updated README

* Added ref to blog "Using Optuna to Optimize PyTorch Ignite Hyperparameters"

* Update README.md

* Fixed bad internal link in examples

* Updated README

* Fixes docs (pytorch#1147)

* Fixed bad link on teaser

* Added manual_seed into docs

* Issue pytorch#1115 : pbar persists due to specific rule in tqdm (notebook) when n < total (pytorch#1145)

* Issue pytorch#1115
pbar persists in notebook due to specific rules when n < total

* close pbar doesn't rise danger bar

* fix when pbar.total is None

Co-authored-by: vfdev <[email protected]>
Co-authored-by: Desroziers <[email protected]>

* Updated codebase such that torch>=1.3 (pytorch#1150)

Co-authored-by: vfdev <[email protected]>

* add wandb (pytorch#1152)

wandb integration already exists, just adding it to the requirements file

* Fixed typo and missing part of "Where to go next" (pytorch#1151)

* Fixes pytorch#1153 (pytorch#1154)

- temporary downgrade of scipy to 1.4.1 instead of 1.5.0

* Use global_step as priority, if it exists (pytorch#1155)

* Use global_step as priority, if it exists

* Fix flake8 error

* Style fix

Co-authored-by: vfdev <[email protected]>

* Fix TrainsSaver handling of Checkpoint's n_saved (pytorch#1135)

* Utilize Trains framework callbacks to better support checkpoint saving and respect Checkpoint.n_saved

* Update trains callbacks to new format

* autopep8 fix

* Fix trains mnist example (store checkpoints in local folder)

* Use trains 0.15.1rc0 until PR is approved

* Use CallbackType for Trains callback type resolution.
Add unit test for Trains callbacks

* Update trains version

* Updated test_trains_saver_callbacks

Co-authored-by: jkhenning <>
Co-authored-by: vfdev <[email protected]>

* Stateful handlers (pytorch#1156)

* Stateful handlers

* Added state_dict/load_state_dict tests for Checkpoint

* integration test

* Updated docstring and added include_self to ModelCheckpoint

* An integreation test for checkpointing with stateful handlers

* Black and flake8

Co-authored-by: vfdev-5 <[email protected]>

* Fixes pytorch#1162 (pytorch#1163)

* Fixes pytorch#1162
- relaxed check of optimizer type

* Updated docs

* Cosmetics (pytorch#1164)

* update ignite version to 0.5.0 in preparation of next release. (pytorch#1158)

Co-authored-by: vfdev <[email protected]>

* Create FUNDING.yml

* Update README.md

Added "Uncertainty Estimation Using a Single Deep Deterministic Neural Network" paper by @y0ast

* Issue 1124 (pytorch#1170)

* Fixes pytorch#1124

- Trains logger can log torch vectors

* Log vector as title=tag+key, series=str(index)

* Improved namings in _XlaDistModel (pytorch#1173)

* Issue 1123 - Improve usage of contrib common methods with other save handlers  (pytorch#1171)

* Added delegated_save_best_models_by_val_score

* Fixes pytorch#1123
- added save_handler arg to setup_common_training_handlers
- added method delegated_save_best_models_by_val_score

* Renamed delegated_save_best_models_by_val_score to gen_save_best_models_by_val_score

* Issue 1165 : nccl + torch.cuda not available (pytorch#1166)

* fix issue 1165

* Update ignite/distributed/comp_models/native.py

Co-authored-by: vfdev <[email protected]>

* add test for nccl /wo gpu

Co-authored-by: Desroziers <[email protected]>
Co-authored-by: vfdev <[email protected]>

* Fix typo in the docstring of ModelCheckpoint

* Fixes failing tests with native dist comp model (pytorch#1177)

- saves/restore env on init/finalize

* Set isort to 4.3.21 as it fails on 5.0 (pytorch#1180)

* improve docs for custom events (pytorch#1179)

* ValueError -> TypeError (pytorch#1175)

* ValueError -> TypeError

* refactor corresponeding unit-test

Co-authored-by: vfdev <[email protected]>

* Update cifar10  (pytorch#1181)

* Updated code to log models on Trains server

* Updated cifar10 example to log necessary things to Trains

* Fix Exception misuse in `ignite.contrib.handlers.base_logger.py` (pytorch#1183)

* ValueError -> TypeError

* NotImplementedError -> NotImplemented

* rollback ignite/engine/events [raise NotImplementedError]

* fix misuses of exceptions in ignite/contrib/handlers/base_logger.py

* refactor corresponding unit tests

Co-authored-by: Sylvain Desroziers <[email protected]>

* Fixed failing cifar10 test (pytorch#1184)

* Fix Exception misuse in `ignite.contrib.handlers.custom_events.py` (pytorch#1186)

* ValueError -> TypeError

* NotImplementedError -> NotImplemented

* rollback ignite/engine/events [raise NotImplementedError]

* fix misuses of exceptions in ignite/contrib/handlers/custom_events.py

* remove period in exceptions

* refactor corresponding unit tests

* Update tpu-tests.yml

* Fix Exception misuse in `ignite.contrib.engines.common.py` (pytorch#1182)

* ValueError -> TypeError

* NotImplementedError -> NotImplemented

* fix misuses of exceptions in ignite/contrib/engines/common.py

* rollback ignite/engine/events [raise NotImplementedError]

Co-authored-by: Sylvain Desroziers <[email protected]>
Co-authored-by: vfdev <[email protected]>

* Refactored test_utils.py into 3 files (pytorch#1185)

- we can better test new coming comp models

Co-authored-by: Sylvain Desroziers <[email protected]>

* Fix Exception misuse in `ignite.contrib.handlers.lr_finder.py` (pytorch#1187)

* ValueError -> TypeError

* NotImplementedError -> NotImplemented

* rollback ignite/engine/events [raise NotImplementedError]

* fix misuses of exceptions in ignite/contrib/handlers/lr_finder.py

* refactor corresponding unit tests

* fix typo

Co-authored-by: Desroziers <[email protected]>
Co-authored-by: Sylvain Desroziers <[email protected]>
Co-authored-by: vfdev <[email protected]>

* Fix Exception misuse in `ignite.contrib.handlers.mlflow_logger.py` (pytorch#1188)

* ValueError -> TypeError

* NotImplementedError -> NotImplemented

* rollback ignite/engine/events [raise NotImplementedError]

* fix misuses of exceptions in ignite/contrib/handlers/mlflow_logger.py & refactor corresponding unit tests

Co-authored-by: Sylvain Desroziers <[email protected]>
Co-authored-by: vfdev <[email protected]>

* Fix Exception misuse in `ignite.contrib.handlers.neptune_logger.py` (pytorch#1189)

* ValueError -> TypeError

* NotImplementedError -> NotImplemented

* rollback ignite/engine/events [raise NotImplementedError]

* fix misuses of exceptions in ignite/contrib/handlers/neptune_logger.py & refactor corresponding unit tests

Co-authored-by: Sylvain Desroziers <[email protected]>
Co-authored-by: vfdev <[email protected]>

* Update README.md (pytorch#1190)

* Update README.md

We are adding a disclaimer to all non-FB led repos in the PyTorch github org. Let me know if you have any concerns. Thanks!

* Update README.md

Co-authored-by: vfdev <[email protected]>

* fix for distributed proxy sampler runtime error (pytorch#1192)

* fix for distributed proxy sampler padding

* fixed formatting

* Updated timers to include fired hanlders' times (pytorch#1104) (pytorch#1194)

* update timers including fired handlers ones

* autopep8 fix

* fix measurement and add test

* rename fire_start_time to handlers_start_time

Co-authored-by: Desroziers <[email protected]>
Co-authored-by: AutoPEP8 <>

* Improve pascalvoc (pytorch#1193)

* Fixes pytorch#1124

- Trains logger can log torch vectors

* [WIP] Fixes issue with exp_trackin
- improved configs
- training script

* [WIP] Added explicit TrainsSaver setup

* Updated training script

* Fixed formatting

* Fixed bad merging

* Added missing rank dispatch for the progressbar

* Custom filename pattern for saving checkpoints (pytorch#1127)

* Custom filename pattern for saving checkpoints

* The suffix check be confused when adding name initially to the dict

* The filename prefix was updated which is not necessary was reverted

* The default filename pattern attribute was set instead of the `_filename_pattern`

* The redundant filename pattern to make filename was ugly, changed to something much more simple.

* The filename pattern implementation changed to have a new way to be initialized via an additional argument.

* - The extension given in the class has a dot infront of it, this can cause issues when having the latest filename pattern. have fixed it by assigning only the extension value not the dot
- The docsstring was updated to latest changes
- The assignment of name to filename pattern was missing

* The tests for checking the checkpoint filenames when a custom filename pattern is given.

* The formatting issue fixed

* - Added a function to get the filename pattern for the default to make it much more readable.
- Updated the current checkpoint __call__ to make filename based on the new function which has introduced
- Updated test_checkpoint_filename_pattern to have the exact values instead have a function.
- Updated a test case where it was failing due to the latest changes in a checkpoint __call__.

* - The _get_filename_pattern function updated to public and static setup_filename_pattern
- The setup_filename_pattern now takes updated arguments of with_score, with_score_name and with_global_step_transform

* The dostring and the static setup_filename_pattern were updated

- The docstring was updated with the filename_pattern also added a example
  for this as well.
- The static function `setup_filename_pattern` to get the default filename pattern
  of a checkpoint didn't have a proper typing. Have updated accordingly
- The `setup_filename_pattern` function accepted the custom filename pattern
  which was not required. Have updated this as well not to accept the custom
  filename pattern.

* The tests for the static function `Checkpoint.setup_filename_pattern`.

* The Docstring for setup_filename_pattern added and have updated the tests for this function.
- The docstring for the function used for making the default filename pattern for checkpoints is added.
- Added a new argument for filename prefix (`with_prefix`).
- The tests for the update is added

* Code clean up to have much more meaning to the code

* Simplified the code and tests

* fix quotes

* Revert "fix quotes"

This reverts commit 1b8d8e1.

Co-authored-by: Sylvain Desroziers <[email protected]>
Co-authored-by: vfdev <[email protected]>

* Docs update and auto_model change (pytorch#1197)

* Fixes pytorch#1174
- Updated docs
- auto_model puts params on device if they are not the device

* - Updated docs

* Update auto.py

* Minor optimization for idist.get_* (pytorch#1196)

* Minor optimization for idist.get_*

* Set overhead threshold to 1.9

* Keep only test_idist_methods_overhead_nccl

* Removed _sync_model_wrapper to implicitly check if we need to sync model
This also reduces time of idist.get_* method calls vs native calls

* Update test_native.py

* autopep8 fix

* Update test_native.py

Co-authored-by: AutoPEP8 <>
Co-authored-by: Sylvain Desroziers <[email protected]>

* Propagate spawn kwargs from parallel to model's spawn (pytorch#1201)

* Fixes pytorch#1199
- Updated code to propagate spawn kwargs
- start_method is fork by default

* Fixed bad syntax

* Fixes pytorch#1198 - bug with CM in PascalVOC example (pytorch#1200)

* Fixes pytorch#1198
- put CM to cpu before converting to numpy
- removed manual recall computation, put into CM definition

* Explicit CM compute by all proc and logging by 0 rank proc

* Added link to Discuss.PyTorch forum (pytorch#1205)

- Updated readme and FAQ

* Fixed wrong IoU computation in Pascal VOC (pytorch#1204)

* Fixed wrong IoU computation

* use black to fix lint check error

* Updated training code:
- added custom_event_filter to log images less frequently
- split events to avoid running validation twice in the end of the training

* Fixed formatting

Co-authored-by: Desroziers <[email protected]>

* Fix Typo in `ignite.handlers.timing` (pytorch#1208)

* ValueError -> TypeError

* NotImplementedError -> NotImplemented

* rollback ignite/engine/events [raise NotImplementedError]

* fix misuses of exceptions in ignite/contrib/handlers/custom_events.py

* remove period in exceptions

* refactor corresponding unit tests

* fix typo in ignite/handlers/timing.py

* Fixes issue with logging XLA tensors (pytorch#1207)

* [WIP] fixed typing

* Fixes pytorch#1136

- fixed problem when all_reduce does not put result tensor to original device

* REFACTOR: Early Return Pattern (if elif else -> if if return) (pytorch#1211)

* Issue 1133 - Fixes flaky Visdom tests (pytorch#1149)

* [WIP] inspect bug

* Attempt to fix flaky Visdom tests

* autopep8 fix

Co-authored-by: vfdev-5 <[email protected]>
Co-authored-by: AutoPEP8 <>

* Updated about page

* Replaced teaser code by a notebook runnable in Colab (pytorch#1216)

* Replaced teaser code by a notebook runnable in Colab

* Updated teaser (py, ipynb)

* Added support of Horovod (pytorch#1195)

* [WIP] Horovod comp model

* [WIP] Horovod comp model
- Implemented spawn
- Added comp model tests

* Refactored test_utils.py into 3 files
- we can better test new coming comp models

* [WIP] Run horovod tests

* [WIP] Horovod comp model + tests

* autopep8 fix

* [WIP] More tests

* Updated utils tests

* autopep8 fix

* [WIP] more tests

* Updated tests and code and cifar10 example

* autopep8 fix

* Fixed failing CI and updated code

* autopep8 fix

* Fixes failing test

* Fixed bug with new/old hvd API and the config

* Added metric tests

* Formatting and docs updated

* Updated frequency test

* Fixed formatting and a typo in idist.model_name docs

* Fixed failing test

* Docs updates and updated auto methods according to horovod API

* autopep8 fix

* Cosmetics

Co-authored-by: AutoPEP8 <>

* metrics: add SSIM (pytorch#1217)

* metrics: add SSIM

* add scikit-image dependency

* add distributed tests, fix docstring

* .gitignore back to normal

* Update ignite/metrics/ssim.py

Co-authored-by: vfdev <[email protected]>

* .format(), separate functions

* scalar input for kernel, sigma, fix py3.5 CI

* apply suggestions

* some fixes

* fixed tpu tests

* Minor code cosmetrics and raised err tolerance in tests

* used list comprehension convolution, fixed tests

* added uniform kernel, change tolerance, various image size tests

* Update ignite/metrics/ssim.py

Co-authored-by: vfdev <[email protected]>

* Update ignite/metrics/ssim.py

Co-authored-by: vfdev <[email protected]>

* Fix flake8

Co-authored-by: Sylvain Desroziers <[email protected]>
Co-authored-by: vfdev <[email protected]>

* add the EpochOutputStore with tests (pytorch#1226)

* add the EpochOutputStore with tests

* add correct import and unify the test cases

* fix checks from flake8 and isort

Co-authored-by: Zhiliang@siemens <[email protected]>

* add horovod test (pytorch#1230) (pytorch#1231)

Co-authored-by: Jeff Yang <[email protected]>

* Update README.md

* Added idist.broadcast (pytorch#1237)

* [WIP] Added idist.broadcast

* Removed unused code

* Added tests to increase coverage

* Docker for users pytorch#1214 (pytorch#1218)

* Docker for users pytorch#1214
- prebuilt docker image handling Ignite examples configuration

* Docker for users pytorch#1214
- more complete basic image based on pytorch 1.5.1-cuda10.1-cudnn7-devel
- with apex, opencv setups and pascal_voc2012 requirements
_ container running with non-privileged user

* Docker for users pytorch#1214
- improve Dockerfiles for vision and apex-vision (TORCH_CUDA_ARCH_LIST as argument)
- propose apex-vision with multi-stage build

* Docker for users pytorch#1214
- Dockerfiles for nlp and vision tasks with their apex version
- user as root, Ignite examples added

* Update README.md

Co-authored-by: Sylvain Desroziers <[email protected]>
Co-authored-by: vfdev <[email protected]>

* [BC-breaking] NotImplementedError -> NotImplemented (pytorch#1178)

* NotImplementedError -> NotImplemented

* returning NotImplemented, instead of raising it

* make type restriction inside  & add corresponding tests

* autopep8 fix

* remove extra spaces

* Updates according to the review

* Fixed unsupported f-string in 3.5
- added more tests

* Updated docs and tests

Co-authored-by: Sylvain Desroziers <[email protected]>
Co-authored-by: AutoPEP8 <>
Co-authored-by: vfdev-5 <[email protected]>

* Allow passing keyword arguments to save function on checkpoint. (pytorch#1245)

* Allow passing keyword arguments to save function on checkpoint.

* Change Docstring

* Add tests for keywords to DiskSaver

* autopep8 fix

* Use pytest.raises instead of xfail.

Co-authored-by: Sylvain Desroziers <[email protected]>
Co-authored-by: AutoPEP8 <>

* Docs updates and fix of black version (pytorch#1250)

* Update governance.rst

* Fix Exception misuse in `ignite.contrib.handlers.param_scheduler.py` (pytorch#1206)

* ValueError -> TypeError

* NotImplementedError -> NotImplemented

* rollback ignite/engine/events [raise NotImplementedError]

* fix misuses of exceptions in ignite/contrib/handlers/custom_events.py

* remove period in exceptions

* refactor corresponding unit tests

* fix misuses of exceptions in ignite/contrib/handlers/param_scheduler.py & refactor corresponding unit tests

* fix misuses of exceptions in ignite/contrib/handlers/param_scheduler.py & refactor corresponding unit tests (stricter: list/tuple -> TypeError & item of list/tuple -> ValueError)

* autopep8 fix

* remove extra spaces

* autopep8 fix

* add matches to pytest.raises

* add match to pytest.raises

* autopep8 fix

* add missing tests

* autopep8 fix

* Update param_scheduler.py

* revert previous modification

Co-authored-by: AutoPEP8 <>
Co-authored-by: Sylvain Desroziers <[email protected]>
Co-authored-by: vfdev <[email protected]>

* Issue pytorch#1247 (pytorch#1252)

* Delete test_custom_events.py

* Delete custom_events.py

* Removing depriciated CustomPeriodicEvent

* Remove deprecated CustomPeriodicEvent

* Update test_tqdm_logger.py

* Remove deprecated CustomPeriodicEvent

* Update test_tqdm_logger.py

Adding needed space.

* Removing CustomPeriodicEvent

* Update handlers.rst

* [WIP] Update readme for docker (pytorch#1254)

* [WIP] Update readme for docker

* Update README.md

Co-authored-by: vfdev <[email protected]>

* Update README.md

Co-authored-by: vfdev <[email protected]>

* [WIP] Update readme for docker
- fix rendering

* [WIP] Update readme for docker
- add DockerHub Ignite repo link and images list

* Updated readme

Co-authored-by: vfdev <[email protected]>

* Update README.md

* Update index.rst

* Update common.py

* Update CONTRIBUTING.md

* [WIP] Added sync_bn to auto_model with tests (pytorch#1265)

* Added dist support for EpochMetric and other similar metrics (pytorch#1229)

* [WIP] Added dist support for EpochMetric with tests

* Updated docs

* [WIP] Added idist.broadcast

* Removed unused code

* [WIP] Updated code

* Code and test updates

* autopep8 fix

* Replaced XLA unsupported type() method by attribute .dtype

* Updated code

Co-authored-by: AutoPEP8 <>

* Fixes pytorch#1258 (pytorch#1268)

- Replaced mp.spawn by mp.start_processes for native comp model

* Updated CONTRIBUTING.md (pytorch#1275)

* Updatd CONTRIBUTING.md

* Update CONTRIBUTING.md

* Rename Epoch to Iterations when using epoch_length with max_epochs=1 (pytorch#1279)

* Set default description as none

* Add test for description with max_epochs set to 1

* Change default description to use iterations when max_epochs=1

* Correct test_pbar_with_max_epochs_set_to_one

* Modify tests to reflect change from epochs to iterations

* Use engine.state.max_epochs instead of engine.state_dict()

* Change Iterations to Iteration

* Correct tests

* Update progress bar docstring

* Update tqdm_logger.py

Co-authored-by: vfdev <[email protected]>

* Update README.md

* [BC-breaking] Make Metrics accumulate values on device specified by user (pytorch#1232) (pytorch#1238)

* Make Metrics accumulate values on device specified by user (pytorch#1232)

* update accuracy to accumulate _num_correct in a tensor on the right device

* update loss metric to accumulate _sum in a tensor on the right device

* update mae metric to accumulate in a tensor on the right device

* update mpd metric to accumulate in a tensor on the right device

* update mse metric to accumulate in a tensor on the right device

* update top k accuracy  metric to accumulate in a tensor on the right device

* update precision and recall metrics to accumulate in tensors on the right device

* .....

* black formatting

* reverted run*.sh

* change all metrics default device to cpu except running_average

* Update ignite/metrics/precision.py

Co-authored-by: vfdev <[email protected]>

* remove Optional type from metric devices since default is cpu

* add comment explaining lack of detach in accuracy metrics

Co-authored-by: vfdev <[email protected]>

* Improved and fixed accuracy tests

* autopep8 fix

* update docs and docstrings for updated metrics (pytorch#1239)

* update accuracy to accumulate _num_correct in a tensor on the right device

* update loss metric to accumulate _sum in a tensor on the right device

* update mae metric to accumulate in a tensor on the right device

* update mpd metric to accumulate in a tensor on the right device

* update mse metric to accumulate in a tensor on the right device

* update top k accuracy  metric to accumulate in a tensor on the right device

* update precision and recall metrics to accumulate in tensors on the right device

* .....

* black formatting

* reverted run*.sh

* change all metrics default device to cpu except running_average

* Update ignite/metrics/precision.py

Co-authored-by: vfdev <[email protected]>

* remove Optional type from metric devices since default is cpu

* add comment explaining lack of detach in accuracy metrics

* update docstrings and docs

* Update ignite/metrics/accumulation.py

Co-authored-by: vfdev <[email protected]>

* Update ignite/metrics/accumulation.py

Co-authored-by: vfdev <[email protected]>

* Update ignite/metrics/accumulation.py

Co-authored-by: vfdev <[email protected]>

* Update ignite/metrics/accuracy.py

Co-authored-by: vfdev <[email protected]>

* Update ignite/metrics/fbeta.py

Co-authored-by: vfdev <[email protected]>

* Update ignite/metrics/loss.py

Co-authored-by: vfdev <[email protected]>

* Update ignite/metrics/metric.py

Co-authored-by: vfdev <[email protected]>

* Update ignite/metrics/precision.py

Co-authored-by: vfdev <[email protected]>

* Update ignite/metrics/recall.py

Co-authored-by: vfdev <[email protected]>

* add comment explaining lack of detach in metrics docs

* support device argument for running_average

* update support for device argumenet for accumulation

* fix and improve device tests for metrics

* fix and improve device tests for metrics

* fix TPU tests

* Apply suggestions from code review

* Apply suggestions from code review

Co-authored-by: vfdev <[email protected]>

* Updates to metrics_impl (pytorch#1266)

* update accuracy to accumulate _num_correct in a tensor on the right device

* update loss metric to accumulate _sum in a tensor on the right device

* update mae metric to accumulate in a tensor on the right device

* update mpd metric to accumulate in a tensor on the right device

* update mse metric to accumulate in a tensor on the right device

* update top k accuracy  metric to accumulate in a tensor on the right device

* update precision and recall metrics to accumulate in tensors on the right device

* .....

* black formatting

* reverted run*.sh

* change all metrics default device to cpu except running_average

* Update ignite/metrics/precision.py

Co-authored-by: vfdev <[email protected]>

* remove Optional type from metric devices since default is cpu

* add comment explaining lack of detach in accuracy metrics

* update docstrings and docs

* Update ignite/metrics/accumulation.py

Co-authored-by: vfdev <[email protected]>

* Update ignite/metrics/accumulation.py

Co-authored-by: vfdev <[email protected]>

* Update ignite/metrics/accumulation.py

Co-authored-by: vfdev <[email protected]>

* Update ignite/metrics/accuracy.py

Co-authored-by: vfdev <[email protected]>

* Update ignite/metrics/fbeta.py

Co-authored-by: vfdev <[email protected]>

* Update ignite/metrics/loss.py

Co-authored-by: vfdev <[email protected]>

* Update ignite/metrics/metric.py

Co-authored-by: vfdev <[email protected]>

* Update ignite/metrics/precision.py

Co-authored-by: vfdev <[email protected]>

* Update ignite/metrics/recall.py

Co-authored-by: vfdev <[email protected]>

* add comment explaining lack of detach in metrics docs

* support device argument for running_average

* update support for device argumenet for accumulation

* fix and improve device tests for metrics

* fix and improve device tests for metrics

* fix TPU tests

* Apply suggestions from code review

* Apply suggestions from code review

* detach tensors earlier in update

* remove redundant to() call

* ensure metrics aren't created on XLA devices

* Fixed isort

* move xla check to Metric.__init__ instead of individual metrics

* update xla tests

* replace deleted callable check

* remove redundant precision and recall __init__

* replace precision/recall __init__ for docs rendering

* add support for metrics_lambda with components on diff devices

Co-authored-by: vfdev <[email protected]>
Co-authored-by: n2cholas <[email protected]>

* Update metrics.rst

* Update metrics.rst

* Fix TPU tests for metrics_impl branch (pytorch#1277)

* update accuracy to accumulate _num_correct in a tensor on the right device

* update loss metric to accumulate _sum in a tensor on the right device

* update mae metric to accumulate in a tensor on the right device

* update mpd metric to accumulate in a tensor on the right device

* update mse metric to accumulate in a tensor on the right device

* update top k accuracy  metric to accumulate in a tensor on the right device

* update precision and recall metrics to accumulate in tensors on the right device

* .....

* black formatting

* reverted run*.sh

* change all metrics default device to cpu except running_average

* Update ignite/metrics/precision.py

Co-authored-by: vfdev <[email protected]>

* remove Optional type from metric devices since default is cpu

* add comment explaining lack of detach in accuracy metrics

* update docstrings and docs

* Update ignite/metrics/accumulation.py

Co-authored-by: vfdev <[email protected]>

* Update ignite/metrics/accumulation.py

Co-authored-by: vfdev <[email protected]>

* Update ignite/metrics/accumulation.py

Co-authored-by: vfdev <[email protected]>

* Update ignite/metrics/accuracy.py

Co-authored-by: vfdev <[email protected]>

* Update ignite/metrics/fbeta.py

Co-authored-by: vfdev <[email protected]>

* Update ignite/metrics/loss.py

Co-authored-by: vfdev <[email protected]>

* Update ignite/metrics/metric.py

Co-authored-by: vfdev <[email protected]>

* Update ignite/metrics/precision.py

Co-authored-by: vfdev <[email protected]>

* Update ignite/metrics/recall.py

Co-authored-by: vfdev <[email protected]>

* add comment explaining lack of detach in metrics docs

* support device argument for running_average

* update support for device argumenet for accumulation

* fix and improve device tests for metrics

* fix and improve device tests for metrics

* fix TPU tests

* Apply suggestions from code review

* Apply suggestions from code review

* detach tensors earlier in update

* remove redundant to() call

* ensure metrics aren't created on XLA devices

* Fixed isort

* move xla check to Metric.__init__ instead of individual metrics

* update xla tests

* replace deleted callable check

* remove redundant precision and recall __init__

* replace precision/recall __init__ for docs rendering

* add support for metrics_lambda with components on diff devices

* fix epoch_metric xla test

Co-authored-by: vfdev <[email protected]>
Co-authored-by: n2cholas <[email protected]>

* metrics_impl fix 2 gpu hvd tests and ensure consistent detaching (pytorch#1280)

* update accuracy to accumulate _num_correct in a tensor on the right device

* update loss metric to accumulate _sum in a tensor on the right device

* update mae metric to accumulate in a tensor on the right device

* update mpd metric to accumulate in a tensor on the right device

* update mse metric to accumulate in a tensor on the right device

* update top k accuracy  metric to accumulate in a tensor on the right device

* update precision and recall metrics to accumulate in tensors on the right device

* .....

* black formatting

* reverted run*.sh

* change all metrics default device to cpu except running_average

* Update ignite/metrics/precision.py

Co-authored-by: vfdev <[email protected]>

* remove Optional type from metric devices since default is cpu

* add comment explaining lack of detach in accuracy metrics

* update docstrings and docs

* Update ignite/metrics/accumulation.py

Co-authored-by: vfdev <[email protected]>

* Update ignite/metrics/accumulation.py

Co-authored-by: vfdev <[email protected]>

* Update ignite/metrics/accumulation.py

Co-authored-by: vfdev <[email protected]>

* Update ignite/metrics/accuracy.py

Co-authored-by: vfdev <[email protected]>

* Update ignite/metrics/fbeta.py

Co-authored-by: vfdev <[email protected]>

* Update ignite/metrics/loss.py

Co-authored-by: vfdev <[email protected]>

* Update ignite/metrics/metric.py

Co-authored-by: vfdev <[email protected]>

* Update ignite/metrics/precision.py

Co-authored-by: vfdev <[email protected]>

* Update ignite/metrics/recall.py

Co-authored-by: vfdev <[email protected]>

* add comment explaining lack of detach in metrics docs

* support device argument for running_average

* update support for device argumenet for accumulation

* fix and improve device tests for metrics

* fix and improve device tests for metrics

* fix TPU tests

* Apply suggestions from code review

* Apply suggestions from code review

* detach tensors earlier in update

* remove redundant to() call

* ensure metrics aren't created on XLA devices

* Fixed isort

* move xla check to Metric.__init__ instead of individual metrics

* update xla tests

* replace deleted callable check

* remove redundant precision and recall __init__

* replace precision/recall __init__ for docs rendering

* add support for metrics_lambda with components on diff devices

* fix epoch_metric xla test

* detach output consistently for all metrics

* fix horovod two gpu tests

* make confusion matrix detaches like other metrics

Co-authored-by: vfdev <[email protected]>
Co-authored-by: n2cholas <[email protected]>

* Fixes failing test on TPUs

Co-authored-by: Nicholas Vadivelu <[email protected]>
Co-authored-by: AutoPEP8 <>
Co-authored-by: Sylvain Desroziers <[email protected]>
Co-authored-by: n2cholas <[email protected]>

* Specify tqdm to be less than or equal to v4.48.0 (pytorch#1293)

* Fixes pytorch#1285 (pytorch#1290)

- use mp.spawn for pytorch < 1.5

* Issue 1249 : fix ParamGroupScheduler with schedulers based on different optimizers (pytorch#1274)

* remove **kwargs from LRScheduler

* revert ParamGroupScheduler inheritance : remove ParamScheduler base class

* use ParamGroupScheduler in ConcatScheduler

* add tests for ParamGroupScheduler with multiple optimizers

* autopep8 fix

* fix doc example

* fix from vfdev comments

* refactor list of optimizers and paranames

* add tests

* autopep8 fix

Co-authored-by: Desroziers <[email protected]>
Co-authored-by: AutoPEP8 <>
Co-authored-by: vfdev <[email protected]>

* remove prints (pytorch#1292)

* remove prints

* code formatting

Co-authored-by: vfdev <[email protected]>

* Fix link to pytorch documents (pytorch#1294)

* Fix link to pytorch documents

* Fix too long lines

Co-authored-by: vfdev <[email protected]>

* Added required_output_keys public attribute (1289) (pytorch#1291)

* Fixes pytorch#1289
- Promoted _required_output_keys to be public as user would like to override it.

* Updated docs

* Fixed typo in docs (concepts). (pytorch#1295)

* Setup Mypy check at CI step (pytorch#1296)

* add mypy file

* add mypy at CI step

* add mypy step at Contributing.md

Co-authored-by: vfdev <[email protected]>

* Update README.md

* Docker for users with Horovod (pytorch#1248)

* [WIP] Docker for users with Horovod
- base / vision / nlp
- with apex build

* [WIP] Docker for users with Horovod
- install horovod with .whl , add nccl in runtime image

* Docker for users with Horovod
- update Readmes for horovod images and configuration

* Docker for users with Horovod
- hvd tags/v0.20.0
- ignite examples with git sparse checkout

* Docker for users with Horovod
- update docs

Co-authored-by: Sylvain Desroziers <[email protected]>
Co-authored-by: vfdev <[email protected]>

* Added input data type check (pytorch#1301)

* Update metrics.rst

* Docker for users with MSDeepSpeed (pytorch#1304)

* Docker for users with DeepSpeed
- msdp-base | vision | nlp

* Docker for users with DeepSpeed
- rename images extensions to msdp-apex-*

Co-authored-by: Sylvain Desroziers <[email protected]>

* Update README.md

* Updated hvd images + scripts (pytorch#1306)

* Updated hvd images
- added scripts to auto build and push images

* Updated scripts according to the review

* Update BatchFiltered docstring

* Improve Canberra metric (pytorch#1312)

* Add abs on denominators in canberra metric and use sklearn in test

* autopep8 fix

* improve docstring

* use canberra on total computation

* Update canberra_metric.py

Co-authored-by: Desroziers <[email protected]>
Co-authored-by: AutoPEP8 <>
Co-authored-by: vfdev <[email protected]>

* Improve Canberra metric for DDP (pytorch#1314)

* refactor canberra metric for ddp

* improve canberra for ddp

* autopep8 fix

* use tensor for accumulation

* detach output

* remove useless item()

* add missing move to device

* refactor detach() and move

* refactor to remove useless view_as and to()

* do not expose reinit__is_reduced ad sync_all_reduce

Co-authored-by: Desroziers <[email protected]>
Co-authored-by: AutoPEP8 <>

* Improve ManhattanDistance metric for DDP  (pytorch#1320)

* fix manhattan distance and improve for ddp

* replace article by sklearn documentation

* Update ignite/contrib/metrics/regression/manhattan_distance.py

Co-authored-by: vfdev <[email protected]>

Co-authored-by: Desroziers <[email protected]>
Co-authored-by: vfdev <[email protected]>

* Update README.md

* Update about.rst

* Update Circle CI docker image to pytorch 1.6.0 (pytorch#1325)

* Update Circle CI docker image to pytorch 1.6. Closes pytorch#1225

* Update Circle CI docker image to pytorch 1.6. Closes pytorch#1225 (pytorch#1322)

* Revert "Update Circle CI docker image to pytorch 1.6. Closes pytorch#1225 (pytorch#1322)" (pytorch#1323)

This reverts commit 22ecac6.

* Update Circle CI docker image to pytorch 1.6.0 Closes pytorch#1225

Co-authored-by: vfdev <[email protected]>

* Update CONTRIBUTING.md

* Add new logo (pytorch#1324)

* Update Circle CI docker image to pytorch 1.6. Closes pytorch#1225 (pytorch#1322)

* Revert "Update Circle CI docker image to pytorch 1.6. Closes pytorch#1225 (pytorch#1322)" (pytorch#1323)

This reverts commit 22ecac6.

* add logos

* remove past logo from readme

* add logo guidelines

* Update README.md

Changed size to 512

* Updated docs logo

Co-authored-by: Juan Miguel Boyero Corral <[email protected]>
Co-authored-by: vfdev <[email protected]>

* Fixed CI on GPUs with pth 1.6.0 (pytorch#1326)

* Fixed CI on GPUs with pth 1.6.0
- updated tests/run_gpu_tests.sh file
- updated nccl version to 2.7 for Horovod build

* Fixed hvd failing tests

* Updated about us (pytorch#1327)

- Added CITATION file

* Improve R2Score metric for DDP (pytorch#1318)

* imrpove r2 for ddp

* autopep8 fix

* _num_examples type is scalar

* autopep8 fix

Co-authored-by: Desroziers <[email protected]>
Co-authored-by: AutoPEP8 <>
Co-authored-by: vfdev <[email protected]>

* Fix canberra docstring :  reference already in namespace (pytorch#1330)

Co-authored-by: Desroziers <[email protected]>
Co-authored-by: vfdev <[email protected]>

* Improve State and Engine docs pytorch#1259 (pytorch#1333)

- add State.restart() method
- add note in Engine.run() docstring / improve error message
- unit test for State.restart()

* pytorch#1336 missing link in doc fix (pytorch#1337)

* Make SSIM accumulate on specified device (pytorch#1328)

* make ssim accumulate on specified device

* keep output on original device until accumulation

* implement more efficient kernel creation

Co-authored-by: vfdev <[email protected]>

* Update documentation for terminate Events (pytorch#1338)

* Update documentation for terminate Events (pytorch#1332)

* Converted raw table in docstring to list table

* Update README.md

Co-authored-by: Anmol Joshi <[email protected]>
Co-authored-by: Sylvain Desroziers <[email protected]>
Co-authored-by: Marijan Smetko <[email protected]>
Co-authored-by: Desroziers <[email protected]>
Co-authored-by: Lavanya Shukla <[email protected]>
Co-authored-by: Akihiro Matsukawa <[email protected]>
Co-authored-by: Jake Henning <[email protected]>
Co-authored-by: Elijah Rippeth <[email protected]>
Co-authored-by: Wang Ran (汪然) <[email protected]>
Co-authored-by: Joseph Spisak <[email protected]>
Co-authored-by: Ryan Wong <[email protected]>
Co-authored-by: Joel Hanson <[email protected]>
Co-authored-by: Wansoo Kim <[email protected]>
Co-authored-by: Jeff Yang <[email protected]>
Co-authored-by: Zhiliang <[email protected]>
Co-authored-by: Zhiliang@siemens <[email protected]>
Co-authored-by: François COKELAER <[email protected]>
Co-authored-by: Kilian Pfeiffer <[email protected]>
Co-authored-by: Tawishi <[email protected]>
Co-authored-by: Michael Hollingworth <[email protected]>
Co-authored-by: Nicholas Vadivelu <[email protected]>
Co-authored-by: n2cholas <[email protected]>
Co-authored-by: Benjamin Lo <[email protected]>
Co-authored-by: Nidhi Zare <[email protected]>
Co-authored-by: Keisuke Kamahori <[email protected]>
Co-authored-by: Théo Dumont <[email protected]>
Co-authored-by: kenjihiraoka <[email protected]>
Co-authored-by: Juan Miguel Boyero Corral <[email protected]>
Co-authored-by: Isabela Presedo-Floyd <[email protected]>
Co-authored-by: Sumit Roy <[email protected]>
Co-authored-by: Shashank Gupta <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Metrics Implementation Question
3 participants