-
Notifications
You must be signed in to change notification settings - Fork 615
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Problem resuming training with RectifiedAdam+Lookahead (Ranger) #1911
Comments
I have exactly same problem as you. After using status = ckpt.restore(ckpt_manager.latest_checkpoint)
status.assert_consumed() It shows the following log.
It seems that lookahead does not maintain its slot "slow" well. @gtg740x Can you check the log of |
/cc @CyberZHG |
Can you check with #2126? |
TensorFlow Addons is transitioning to a minimal maintenance and release mode. New features will not be added to this repository. For more information, please see our public messaging on this decision: Please consider sending feature requests / contributions to other repositories in the TF community with a similar charters to TFA: |
System information
Describe the bug
If I train a model using the Ranger scheme of a RectifiedAdam optimizer paired with a LookAhead optimizer, I cannot interrupt and resume training as normal. Using the exact same code, but with a standard Adam optimizer training resumes as expected. When using the Ranger scheme, training does not resume as expected.
When we resume training, the model restores to the same accuracy it paused at. But once training steps resume the accuracy curve will drop for many training steps before slowly moving back to where its upward progression was trending before pausing training. The result is a much slower and choppier convergence than a run where the experiment is never paused.
If the Ranger setup is used for the full run, training progresses smoothly as expected and converges to the expected accuracy in the expected number of steps smoothly.
Provide a reproducible test case that is the bare minimum necessary to generate the problem.
Run training and stop at a global_step > 1000 but < max_train_steps
Then when I try to resume training from a saved model:
And re-enter the training loop above except with total_steps starting at the restored number of steps, the model restores to its previous accuracy. However, as soon as training steps resume there is an immediate dip in accuracy as if the optimizer has to "warm-up" again. Possibly due to the LookAhead slow weights?
Other info / logs
Include any logs or source code that would be helpful to diagnose the problem. If including tracebacks, please include the full traceback. Large logs and files should be attached.
The text was updated successfully, but these errors were encountered: