-
Notifications
You must be signed in to change notification settings - Fork 506
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OOM during training on v3-8 #1881
Comments
There are 16GB per core, not 128GB unified memory 😉 But I can see we currently do not support |
Thanks for clarifying. I will try again when you add uniform_() support. |
The export XLA_USE_BF16=1 |
Thanks. I set XLA_USE_BF16 and the training is going fine, but the optimizer takes a step very slowly, much slower than on the GPU |
Ok, on the second try optimizer making one step on the normal speed. It a little bit strange |
This is normal. |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
Hi, I trying to train new sinkhorn-transformer model on TPU and train falling on optimizer step.
With reformer model training isn't falling with OOM.
For reprodusing clone and install package from here
Script to reproduce issue
StackTrace:
Why this issue is happened? v3-8 TPU have 128GB memory, size of this model exactly the same that reformer model. On GPU this model allocate 40GB with optimizer and loss(with batch size 1)
The text was updated successfully, but these errors were encountered: