Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory allocated on gpu:0 when using torch.cuda.empty_cache() #3016

Closed
kmistry-wx opened this issue Aug 17, 2020 · 9 comments
Closed

Memory allocated on gpu:0 when using torch.cuda.empty_cache() #3016

kmistry-wx opened this issue Aug 17, 2020 · 9 comments
Labels
bug Something isn't working help wanted Open to be worked on
Milestone

Comments

@kmistry-wx
Copy link

kmistry-wx commented Aug 17, 2020

🐛 Bug

Pytorch lightning calls torch.cuda.empty_cache() at times, e.g. at the end of the training loop. When the trainer is set to run on GPUs other than gpu:0, it still allocates memory on gpu:0 when running torch.cuda.empty_cache(). Apparently this is the initial device context, but it can be avoided. For example,

with torch.cuda.device('cuda:1'):
    torch.cuda.empty_cache()

If the cache is emptied in this way, it will not allocate memory on any other gpu other than the one specified

This seems to be the same issue as in #458, but was never resolved and is still an issue.

To Reproduce

Steps to reproduce the behavior:

  1. Create a pl.Trainer with gpus=[1]
  2. Fit a model on gpu:1
  3. torch.cuda.empty_cache() runs in run_training_teardown at the end of the training loop
  4. nvidia-smi shows memory usage on gpu:0
  5. If gpu:0 already had high memory allocation because of another job, then it will throw a CUDA out of memory error
.../pytorch_lightning/trainer/training_loop.py in run_training_teardown(self)
   1153             model = self.get_model()
   1154             model.cpu()
-> 1155             torch.cuda.empty_cache()
   1156 
   1157     def training_forward(self, batch, batch_idx, opt_idx, hiddens):

.../torch/cuda/memory.py in empty_cache()
     84     """
     85     if is_initialized():
---> 86         torch._C._cuda_emptyCache()
     87 
     88 

RuntimeError: CUDA error: out of memory

Code sample

trainer = Trainer(gpus=[1])
trainer.fit(task, train_dataloader, val_dataloader)

Expected behavior

Only gpu:1 should be used when training this model.

Environment

  • CUDA:
    • GPU:
      • GeForce RTX 2080 Ti
      • GeForce GTX 1080 Ti
    • available: True
    • version: 10.1
  • Packages:
    • numpy: 1.18.1
    • pyTorch_debug: False
    • pyTorch_version: 1.5.0
    • pytorch-lightning: 0.9.0rc12
    • tensorboard: 2.2.1
    • tqdm: 4.46.0
  • System:
    • OS: Linux
    • architecture:
      • 64bit
      • ELF
    • processor: x86_64
    • python: 3.8.3
    • version: 18.04.1-Ubuntu
@kmistry-wx kmistry-wx added bug Something isn't working help wanted Open to be worked on labels Aug 17, 2020
@github-actions
Copy link
Contributor

Hi! thanks for your contribution!, great first issue!

@nateraw
Copy link
Contributor

nateraw commented Aug 17, 2020

Mind submitting a PR for this?

@justusschock
Copy link
Member

I remember that there was a bug in pytorch, which always created a cuda context (and thus allocated memory) on the default gpu (which is the first one). Not sure if this is also the issue here though

@williamFalcon
Copy link
Contributor

yeah, i think this is a pytorch bug... will investigate after 0.9

@williamFalcon williamFalcon added this to the 1.0.0 milestone Aug 19, 2020
@edenlightning edenlightning modified the milestones: 1.0.0, 0.9.x Sep 1, 2020
@edenlightning
Copy link
Contributor

@kmistry-wx does the issue persist?

@kmistry-wx
Copy link
Author

@edenlightning After upgrading (PyTorch 1.5.0 -> 1.6.0, Pytorch Lightning 0.9.0rc12 -> 0.9.0), I can no longer reproduce. A trainer run on gpu:1 seems to only use resources from gpu:1 now

@D-X-Y
Copy link

D-X-Y commented Mar 30, 2021

I'm using 1.8.0 on a Linux with CUDA 11.2. I also found this problem. Everything of my model is on GPU:1, when I call torch.cuda.empty_cache. It increased some memory usage on GPU:0

@xingqian2018
Copy link

Same issue has been found on 1.6.0

@YunruiZhang
Copy link

Same issue found on 1.13.1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working help wanted Open to be worked on
Projects
None yet
Development

No branches or pull requests

8 participants