Memory leaking when using large numpy array in Dataset #1761

mpaepper · 2020-05-08T17:14:19Z

🐛 Bug

Thank you for the great library! When migrating a larger project, I am running into memory issues, though, so maybe someone can help me out.

So, I have a pretty complicated DataSet which loads plenty of data and buffers the data into the CPU RAM as a numpy array.

I train using ddp and with num_workers = 6 in the dataloader. The training crashes my machine, because of CPU memory overflow. It works with num_workers = 0, but the higher the num_workers, the higher the memory consumption.

I figured out that this is much worse when using a large numpy array in the Dataset rather than a PyTorch tensor.
Unfortunately, I need numpy arrays, so I am asking you if there is anything I can do?

To Reproduce

I created a repository to reproduce this. It allows you to train a model on toy data using either a PyTorch tensor or a numpy array in the Dataset.

When running it with the PyTorch tensor the same amount of data uses 5GB of RAM while with Numpy it uses more than 30GB of RAM.
The higher the number of num_workers, the higher the RAM usage - it seems to leak when using numpy?

Clone https://github.com/mpaepper/reproduce_pytorch_lightning_memory_issues
Try the PyTorch tensor with: python minimal.py --num_workers 10
Try the numpy array with: python minimal.py --numpy --num_workers 10
Compare the huge difference in memory consumption

Code sample

https://github.com/mpaepper/reproduce_pytorch_lightning_memory_issues

Expected behavior

I would expect that numpy and PyTorch tensors should behave in the same way when using num_workers > 0, i.e. memory consumption is similar.

Environment

CUDA:
- GPU:
- GeForce RTX 2080 Ti
- GeForce RTX 2080 Ti
- GeForce RTX 2080 Ti
- GeForce RTX 2080 Ti
- available: True
- version: 10.1
Packages:
- numpy: 1.16.4
- pyTorch_debug: False
- pyTorch_version: 1.4.0
- pytorch-lightning: 0.7.5
- tensorboard: 1.14.0
- tqdm: 4.46.0
System:
- OS: Linux
- architecture:
- 64bit
-
- processor: x86_64
- python: 3.7.3
- version: Support for multiple val_dataloaders #97-Ubuntu SMP Wed Apr 1 03:25:46 UTC 2020

github-actions · 2020-05-08T17:15:07Z

Hi! thanks for your contribution!, great first issue!

bjmnbraun · 2020-05-08T20:04:53Z

I had a similar issue and if I recall defining environmental variable COLAB_GPU forces pytorch lightning to use fork, which might prevent this Nx memory blowup.

https://github.com/PyTorchLightning/pytorch-lightning/blob/master/pytorch_lightning/trainer/trainer.py#L779

mpaepper · 2020-05-08T20:12:09Z

Thank you for the answer, but it seems that that option only works for TPU training?
I am training on GPUs.

I tried it out anyways, but it didn't improve my situation. Any other pointers / ideas?

mpaepper · 2020-05-11T07:59:10Z

I tried to manually rewrite the PyTorch Lightning code to use fork instead of spawn, but then the error "Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 'spawn' start method" comes up:

Process Process-1:
Traceback (most recent call last):
  File "/home/xxx/anaconda3/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
    self.run()
  File "/home/xxx/anaconda3/lib/python3.7/multiprocessing/process.py", line 99, in run
    self._target(*self._args, **self._kwargs)
  File "/home/xxx/anaconda3/lib/python3.7/site-packages/pytorch_lightning/trainer/distrib_data_parallel.py", line 345, in ddp_train
    torch.cuda.set_device(self.root_gpu)
  File "/home/xxx/anaconda3/lib/python3.7/site-packages/torch/cuda/__init__.py", line 292, in set_device
    torch._C._cuda_setDevice(device)
  File "/home/xxx/anaconda3/lib/python3.7/site-packages/torch/cuda/__init__.py", line 195, in _lazy_init
    "Cannot re-initialize CUDA in forked subprocess. " + msg)
RuntimeError: Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 'spawn' start method

Borda · 2020-05-11T20:43:29Z

mind be similar to #1769

mpaepper · 2020-05-19T10:04:11Z

So for others running into this:

As a workaround, during __init__ I move everything from numpy to PyTorch tensors, so they are stored in RAM -> then the shared memory works. When I use them, I transform them back from PyTorch to numpy (.detach().numpy()).
However, it might fail when you have large amounts of memories stored here, because the file limits of your operating system don't allow you to have enough files open.

Check out ulimit -n (was 1024 for me).

Setting it to a higher limit with ulimit -n 9999 then fixed the error and training works.

However, it still seems too slow. It's only half as fast as it was using Torchbearer before.

The more num_workers I use in the Dataloader, the slower the start of an epoch similar as described here in this issue: #1884

williamFalcon · 2020-06-01T15:04:43Z

@mpaepper check again?
This should be fixed on master now

mpaepper · 2020-06-03T06:12:22Z

Yes, thank you. It's resolved with the recent master additions 👍

mpaepper added bug Something isn't working help wanted Open to be worked on labels May 8, 2020

Borda assigned jeremyjordan May 19, 2020

jeremyjordan removed their assignment May 22, 2020

williamFalcon mentioned this issue May 31, 2020

Replaces ddp .spawn with subprocess #2029

Merged

williamFalcon closed this as completed Jun 1, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory leaking when using large numpy array in Dataset #1761

Memory leaking when using large numpy array in Dataset #1761

mpaepper commented May 8, 2020

github-actions bot commented May 8, 2020

bjmnbraun commented May 8, 2020

mpaepper commented May 8, 2020 •

edited

Loading

mpaepper commented May 11, 2020 •

edited by Borda

Loading

Borda commented May 11, 2020

mpaepper commented May 19, 2020 •

edited

Loading

williamFalcon commented Jun 1, 2020

mpaepper commented Jun 3, 2020

Memory leaking when using large numpy array in Dataset #1761

Memory leaking when using large numpy array in Dataset #1761

Comments

mpaepper commented May 8, 2020

🐛 Bug

To Reproduce

Code sample

Expected behavior

Environment

github-actions bot commented May 8, 2020

bjmnbraun commented May 8, 2020

mpaepper commented May 8, 2020 • edited Loading

mpaepper commented May 11, 2020 • edited by Borda Loading

Borda commented May 11, 2020

mpaepper commented May 19, 2020 • edited Loading

williamFalcon commented Jun 1, 2020

mpaepper commented Jun 3, 2020

mpaepper commented May 8, 2020 •

edited

Loading

mpaepper commented May 11, 2020 •

edited by Borda

Loading

mpaepper commented May 19, 2020 •

edited

Loading