-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Memory allocated on gpu:0 when using torch.cuda.empty_cache() #3016
Comments
Hi! thanks for your contribution!, great first issue! |
Mind submitting a PR for this? |
I remember that there was a bug in pytorch, which always created a cuda context (and thus allocated memory) on the default gpu (which is the first one). Not sure if this is also the issue here though |
yeah, i think this is a pytorch bug... will investigate after 0.9 |
@kmistry-wx does the issue persist? |
@edenlightning After upgrading (PyTorch 1.5.0 -> 1.6.0, Pytorch Lightning 0.9.0rc12 -> 0.9.0), I can no longer reproduce. A trainer run on gpu:1 seems to only use resources from gpu:1 now |
I'm using 1.8.0 on a Linux with CUDA 11.2. I also found this problem. Everything of my model is on GPU:1, when I call torch.cuda.empty_cache. It increased some memory usage on GPU:0 |
Same issue has been found on 1.6.0 |
Same issue found on 1.13.1 |
🐛 Bug
Pytorch lightning calls torch.cuda.empty_cache() at times, e.g. at the end of the training loop. When the trainer is set to run on GPUs other than gpu:0, it still allocates memory on gpu:0 when running torch.cuda.empty_cache(). Apparently this is the initial device context, but it can be avoided. For example,
If the cache is emptied in this way, it will not allocate memory on any other gpu other than the one specified
This seems to be the same issue as in #458, but was never resolved and is still an issue.
To Reproduce
Steps to reproduce the behavior:
nvidia-smi
shows memory usage on gpu:0Code sample
Expected behavior
Only gpu:1 should be used when training this model.
Environment
The text was updated successfully, but these errors were encountered: