-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
validation_epoch_end behavior with DDP #1479
Comments
|
I asked around and apparently that is the intended behaviour right now, i.e. validation_epoch_end is per-process, and we cannot access global information for metrics or logging. I was able to solve this by doing the reduce_all myself, with something like this:
Not sure why I had to explicitly cast the tensors to their processes (their devices were all set to -1). |
Also want native support of this ability. such as adding an argument to |
I am using the latest version |
I might be misunderstanding how PL works, but when using DDP my validation_epoch_end argument still contains batches from single GPUs, and I thought they would be collated from all GPUs.
E.g. My validation dataset has 888 images, but when I validate on 8 GPUs (batch size of 1), I only get 111 batches in validation_epoch_end.
If that's correct, how can I produce metrics that combine information from all GPUs?
The text was updated successfully, but these errors were encountered: