You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
If gradient_accumulation is > 1 and a custom scheduler is used that updated the LR based on steps (instead of default epochs) than global step is incorrect since it is advancing at every batch part (depending on gradient_accumulation value) instead only after all parts of the batch times gradient_accumulation.
To fix this:
Trainer global_step is advanced only if global_step % gradient_accumulation == 0.
it has no effect if gradient_accumulation == 1 (global step is advancing as currently implemented)
To Reproduce
Steps to reproduce the behavior:
Run any model with gradient_accumulation > 1 and verify with trainer.global_step
Python version: 3.7
Is CUDA available: Yes
CUDA runtime version: 10.1.243
GPU models and configuration:
GPU 0: TITAN Xp
GPU 1: TITAN Xp
GPU 2: TITAN Xp
GPU 3: TITAN Xp
Nvidia driver version: 418.87.01
cuDNN version: Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.7.4.2
/usr/local/cuda-10.0/targets/x86_64-linux/lib/libcudnn.so.7.4.2
/usr/local/cuda-8.0/targets/x86_64-linux/lib/libcudnn.so.6.0.20
🐛 Bug
If gradient_accumulation is > 1 and a custom scheduler is used that updated the LR based on steps (instead of default epochs) than global step is incorrect since it is advancing at every batch part (depending on gradient_accumulation value) instead only after all parts of the batch times gradient_accumulation.
To fix this:
Trainer global_step is advanced only if global_step % gradient_accumulation == 0.
it has no effect if gradient_accumulation == 1 (global step is advancing as currently implemented)
To Reproduce
Steps to reproduce the behavior:
Run any model with gradient_accumulation > 1 and verify with
trainer.global_step
Expected behavior
For example: (batch size=1, accumulation=2)
correct flow:
Environment
Collecting environment information...
PyTorch version: 1.4.0
Is debug build: No
CUDA used to build PyTorch: 10.1
OS: Ubuntu 16.04.6 LTS
GCC version: (Ubuntu 5.4.0-6ubuntu1~16.04.12) 5.4.0 20160609
CMake version: version 3.5.1
Python version: 3.7
Is CUDA available: Yes
CUDA runtime version: 10.1.243
GPU models and configuration:
GPU 0: TITAN Xp
GPU 1: TITAN Xp
GPU 2: TITAN Xp
GPU 3: TITAN Xp
Nvidia driver version: 418.87.01
cuDNN version: Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.7.4.2
/usr/local/cuda-10.0/targets/x86_64-linux/lib/libcudnn.so.7.4.2
/usr/local/cuda-8.0/targets/x86_64-linux/lib/libcudnn.so.6.0.20
Versions of relevant libraries:
[pip] numpy==1.18.1
[pip] pytorch-ignite==0.2.1
[pip] pytorch-lightning==0.6.0
[pip] torch==1.4.0
[pip] torchprof==1.0.0
[pip] torchvision==0.5.0
[conda] blas 1.0 mkl
[conda] mkl 2019.4 243
[conda] mkl-service 2.3.0 py37he904b0f_0
[conda] mkl_fft 1.0.15 py37ha843d7b_0
[conda] mkl_random 1.1.0 py37hd6b4f25_0
[conda] pytorch 1.4.0 py3.7_cuda10.1.243_cudnn7.6.3_0 pytorch
[conda] pytorch-ignite 0.2.1 pypi_0 pypi
[conda] pytorch-lightning 0.6.0 dev_0
[conda] torchprof 1.0.0 pypi_0 pypi
[conda] torchvision 0.5.0 py37_cu101 pytorch
The text was updated successfully, but these errors were encountered: