global_step advanced between accumulations if gradient_accumulation > 1 #831

peteriz · 2020-02-13T09:58:49Z

🐛 Bug

If gradient_accumulation is > 1 and a custom scheduler is used that updated the LR based on steps (instead of default epochs) than global step is incorrect since it is advancing at every batch part (depending on gradient_accumulation value) instead only after all parts of the batch times gradient_accumulation.

To fix this:
Trainer global_step is advanced only if global_step % gradient_accumulation == 0.
it has no effect if gradient_accumulation == 1 (global step is advancing as currently implemented)

To Reproduce

Steps to reproduce the behavior:

Run any model with gradient_accumulation > 1 and verify with trainer.global_step

Expected behavior

For example: (batch size=1, accumulation=2)

b_idx = 0, global_step = 1
b_idx = 1, global_step = 2
backprop
b_idx = 2, global_step = 3
b_idx = 3, global_step = 4
backprop

correct flow:

b_idx = 0, global_step = 1
b_idx = 1, global_step = 1
backprop
b_idx = 2, global_step = 2
b_idx = 3, global_step = 2
backprop

Environment

Collecting environment information...
PyTorch version: 1.4.0
Is debug build: No
CUDA used to build PyTorch: 10.1

OS: Ubuntu 16.04.6 LTS
GCC version: (Ubuntu 5.4.0-6ubuntu1~16.04.12) 5.4.0 20160609
CMake version: version 3.5.1

Python version: 3.7
Is CUDA available: Yes
CUDA runtime version: 10.1.243
GPU models and configuration:
GPU 0: TITAN Xp
GPU 1: TITAN Xp
GPU 2: TITAN Xp
GPU 3: TITAN Xp

Nvidia driver version: 418.87.01
cuDNN version: Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.7.4.2
/usr/local/cuda-10.0/targets/x86_64-linux/lib/libcudnn.so.7.4.2
/usr/local/cuda-8.0/targets/x86_64-linux/lib/libcudnn.so.6.0.20

Versions of relevant libraries:
[pip] numpy==1.18.1
[pip] pytorch-ignite==0.2.1
[pip] pytorch-lightning==0.6.0
[pip] torch==1.4.0
[pip] torchprof==1.0.0
[pip] torchvision==0.5.0
[conda] blas 1.0 mkl
[conda] mkl 2019.4 243
[conda] mkl-service 2.3.0 py37he904b0f_0
[conda] mkl_fft 1.0.15 py37ha843d7b_0
[conda] mkl_random 1.1.0 py37hd6b4f25_0
[conda] pytorch 1.4.0 py3.7_cuda10.1.243_cudnn7.6.3_0 pytorch
[conda] pytorch-ignite 0.2.1 pypi_0 pypi
[conda] pytorch-lightning 0.6.0 dev_0
[conda] torchprof 1.0.0 pypi_0 pypi
[conda] torchvision 0.5.0 py37_cu101 pytorch

The text was updated successfully, but these errors were encountered:

Borda · 2020-02-13T23:30:21Z

hi, thanks for bringing this up... I think that there is a misunderstanding difference between step and epoch. There:

global_step is step-index since beginning
total_batch_idx is batch-index since beginning
batch_idx is batch-index in a particular epoch

@williamFalcon can you confirm it? then we shall add it to code/docs...

peteriz added the bug Something isn't working label Feb 13, 2020

peteriz mentioned this issue Feb 13, 2020

Fix global_step when gradient accumulation > 1 #832

Merged

4 tasks

Borda added the information needed label Feb 13, 2020

williamFalcon closed this as completed in #832 Feb 16, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

global_step advanced between accumulations if gradient_accumulation > 1 #831

global_step advanced between accumulations if gradient_accumulation > 1 #831

peteriz commented Feb 13, 2020

Borda commented Feb 13, 2020

global_step advanced between accumulations if gradient_accumulation > 1 #831

global_step advanced between accumulations if gradient_accumulation > 1 #831

Comments

peteriz commented Feb 13, 2020

🐛 Bug

To Reproduce

Expected behavior

Environment

Borda commented Feb 13, 2020