Memory allocated on gpu:0 when using torch.cuda.empty_cache() #3016

kmistry-wx · 2020-08-17T14:02:11Z

🐛 Bug

Pytorch lightning calls torch.cuda.empty_cache() at times, e.g. at the end of the training loop. When the trainer is set to run on GPUs other than gpu:0, it still allocates memory on gpu:0 when running torch.cuda.empty_cache(). Apparently this is the initial device context, but it can be avoided. For example,

with torch.cuda.device('cuda:1'):
    torch.cuda.empty_cache()

If the cache is emptied in this way, it will not allocate memory on any other gpu other than the one specified

This seems to be the same issue as in #458, but was never resolved and is still an issue.

To Reproduce

Steps to reproduce the behavior:

Create a pl.Trainer with gpus=[1]
Fit a model on gpu:1
torch.cuda.empty_cache() runs in run_training_teardown at the end of the training loop
nvidia-smi shows memory usage on gpu:0
If gpu:0 already had high memory allocation because of another job, then it will throw a CUDA out of memory error

.../pytorch_lightning/trainer/training_loop.py in run_training_teardown(self)
   1153             model = self.get_model()
   1154             model.cpu()
-> 1155             torch.cuda.empty_cache()
   1156 
   1157     def training_forward(self, batch, batch_idx, opt_idx, hiddens):

.../torch/cuda/memory.py in empty_cache()
     84     """
     85     if is_initialized():
---> 86         torch._C._cuda_emptyCache()
     87 
     88 

RuntimeError: CUDA error: out of memory

Code sample

trainer = Trainer(gpus=[1])
trainer.fit(task, train_dataloader, val_dataloader)

Expected behavior

Only gpu:1 should be used when training this model.

Environment

CUDA:
- GPU:
  - GeForce RTX 2080 Ti
  - GeForce GTX 1080 Ti
- available: True
- version: 10.1
Packages:
- numpy: 1.18.1
- pyTorch_debug: False
- pyTorch_version: 1.5.0
- pytorch-lightning: 0.9.0rc12
- tensorboard: 2.2.1
- tqdm: 4.46.0
System:
- OS: Linux
- architecture:
  - 64bit
  - ELF
- processor: x86_64
- python: 3.8.3
- version: 18.04.1-Ubuntu

The text was updated successfully, but these errors were encountered:

github-actions · 2020-08-17T14:02:49Z

Hi! thanks for your contribution!, great first issue!

nateraw · 2020-08-17T17:18:04Z

Mind submitting a PR for this?

justusschock · 2020-08-18T07:00:47Z

I remember that there was a bug in pytorch, which always created a cuda context (and thus allocated memory) on the default gpu (which is the first one). Not sure if this is also the issue here though

williamFalcon · 2020-08-19T23:36:41Z

yeah, i think this is a pytorch bug... will investigate after 0.9

edenlightning · 2020-09-16T17:23:37Z

@kmistry-wx does the issue persist?

kmistry-wx · 2020-09-18T08:10:28Z

@edenlightning After upgrading (PyTorch 1.5.0 -> 1.6.0, Pytorch Lightning 0.9.0rc12 -> 0.9.0), I can no longer reproduce. A trainer run on gpu:1 seems to only use resources from gpu:1 now

D-X-Y · 2021-03-30T08:41:21Z

I'm using 1.8.0 on a Linux with CUDA 11.2. I also found this problem. Everything of my model is on GPU:1, when I call torch.cuda.empty_cache. It increased some memory usage on GPU:0

xingqian2018 · 2021-04-06T17:41:55Z

Same issue has been found on 1.6.0

YunruiZhang · 2023-04-06T04:19:18Z

Same issue found on 1.13.1

kmistry-wx added bug Something isn't working help wanted Open to be worked on labels Aug 17, 2020

nateraw mentioned this issue Aug 18, 2020

Incorrect default cuda device when using single gpu other than cuda:0 #3030

Closed

williamFalcon added this to the 1.0.0 milestone Aug 19, 2020

edenlightning modified the milestones: 1.0.0, 0.9.x Sep 1, 2020

williamFalcon closed this as completed Sep 18, 2020

leo19941227 mentioned this issue Mar 10, 2021

torch.cuda.empty_cache() raises RuntimeError: CUDA error: out of memory s3prl/s3prl#102

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory allocated on gpu:0 when using torch.cuda.empty_cache() #3016

Memory allocated on gpu:0 when using torch.cuda.empty_cache() #3016

kmistry-wx commented Aug 17, 2020 •

edited

Loading

github-actions bot commented Aug 17, 2020

nateraw commented Aug 17, 2020

justusschock commented Aug 18, 2020

williamFalcon commented Aug 19, 2020

edenlightning commented Sep 16, 2020

kmistry-wx commented Sep 18, 2020

D-X-Y commented Mar 30, 2021

xingqian2018 commented Apr 6, 2021

YunruiZhang commented Apr 6, 2023

Memory allocated on gpu:0 when using torch.cuda.empty_cache() #3016

Memory allocated on gpu:0 when using torch.cuda.empty_cache() #3016

Comments

kmistry-wx commented Aug 17, 2020 • edited Loading

🐛 Bug

To Reproduce

Code sample

Expected behavior

Environment

github-actions bot commented Aug 17, 2020

nateraw commented Aug 17, 2020

justusschock commented Aug 18, 2020

williamFalcon commented Aug 19, 2020

edenlightning commented Sep 16, 2020

kmistry-wx commented Sep 18, 2020

D-X-Y commented Mar 30, 2021

xingqian2018 commented Apr 6, 2021

YunruiZhang commented Apr 6, 2023

kmistry-wx commented Aug 17, 2020 •

edited

Loading