Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

global_step advanced between accumulations if gradient_accumulation > 1 #831

Closed
peteriz opened this issue Feb 13, 2020 · 1 comment · Fixed by #832
Closed

global_step advanced between accumulations if gradient_accumulation > 1 #831

peteriz opened this issue Feb 13, 2020 · 1 comment · Fixed by #832
Labels
bug Something isn't working

Comments

@peteriz
Copy link

peteriz commented Feb 13, 2020

🐛 Bug

If gradient_accumulation is > 1 and a custom scheduler is used that updated the LR based on steps (instead of default epochs) than global step is incorrect since it is advancing at every batch part (depending on gradient_accumulation value) instead only after all parts of the batch times gradient_accumulation.

To fix this:
Trainer global_step is advanced only if global_step % gradient_accumulation == 0.
it has no effect if gradient_accumulation == 1 (global step is advancing as currently implemented)

To Reproduce

Steps to reproduce the behavior:

Run any model with gradient_accumulation > 1 and verify with trainer.global_step

Expected behavior

For example: (batch size=1, accumulation=2)

b_idx = 0, global_step = 1
b_idx = 1, global_step = 2
backprop
b_idx = 2, global_step = 3
b_idx = 3, global_step = 4
backprop

correct flow:

b_idx = 0, global_step = 1
b_idx = 1, global_step = 1
backprop
b_idx = 2, global_step = 2
b_idx = 3, global_step = 2
backprop

Environment

Collecting environment information...
PyTorch version: 1.4.0
Is debug build: No
CUDA used to build PyTorch: 10.1

OS: Ubuntu 16.04.6 LTS
GCC version: (Ubuntu 5.4.0-6ubuntu1~16.04.12) 5.4.0 20160609
CMake version: version 3.5.1

Python version: 3.7
Is CUDA available: Yes
CUDA runtime version: 10.1.243
GPU models and configuration:
GPU 0: TITAN Xp
GPU 1: TITAN Xp
GPU 2: TITAN Xp
GPU 3: TITAN Xp

Nvidia driver version: 418.87.01
cuDNN version: Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.7.4.2
/usr/local/cuda-10.0/targets/x86_64-linux/lib/libcudnn.so.7.4.2
/usr/local/cuda-8.0/targets/x86_64-linux/lib/libcudnn.so.6.0.20

Versions of relevant libraries:
[pip] numpy==1.18.1
[pip] pytorch-ignite==0.2.1
[pip] pytorch-lightning==0.6.0
[pip] torch==1.4.0
[pip] torchprof==1.0.0
[pip] torchvision==0.5.0
[conda] blas 1.0 mkl
[conda] mkl 2019.4 243
[conda] mkl-service 2.3.0 py37he904b0f_0
[conda] mkl_fft 1.0.15 py37ha843d7b_0
[conda] mkl_random 1.1.0 py37hd6b4f25_0
[conda] pytorch 1.4.0 py3.7_cuda10.1.243_cudnn7.6.3_0 pytorch
[conda] pytorch-ignite 0.2.1 pypi_0 pypi
[conda] pytorch-lightning 0.6.0 dev_0
[conda] torchprof 1.0.0 pypi_0 pypi
[conda] torchvision 0.5.0 py37_cu101 pytorch

@peteriz peteriz added the bug Something isn't working label Feb 13, 2020
@Borda
Copy link
Member

Borda commented Feb 13, 2020

hi, thanks for bringing this up... I think that there is a misunderstanding difference between step and epoch. There:

  • global_step is step-index since beginning
  • total_batch_idx is batch-index since beginning
  • batch_idx is batch-index in a particular epoch

@williamFalcon can you confirm it? then we shall add it to code/docs...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants