CUDA error: an illegal memory access was encountered after updating to the latest stable packages #2085

brucemuller · 2020-06-05T17:45:39Z

Can anyone help with this CUDA error: an illegal memory access was encountered ??

It runs fine for several iterations...

🐛 Bug

Traceback (most recent call last):
  File "train_gpu.py", line 237, in <module>
    main_local(hparam_trial)   
  File "train_gpu.py", line 141, in main_local
    trainer.fit(model)
  File "/shared/storage/cs/staffstore/username/anaconda3/envs/sh1/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 859, in fit
    self.single_gpu_train(model)
  File "/shared/storage/cs/staffstore/username/anaconda3/envs/sh1/lib/python3.7/site-packages/pytorch_lightning/trainer/distrib_parts.py", line 503, in single_gpu_train
    self.run_pretrain_routine(model)
  File "/shared/storage/cs/staffstore/username/anaconda3/envs/sh1/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1015, in run_pretrain_routine
    self.train()
  File "/shared/storage/cs/staffstore/username/anaconda3/envs/sh1/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 347, in train
    self.run_training_epoch()
  File "/shared/storage/cs/staffstore/username/anaconda3/envs/sh1/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 419, in run_training_epoch
    _outputs = self.run_training_batch(batch, batch_idx)
  File "/shared/storage/cs/staffstore/username/anaconda3/envs/sh1/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 604, in run_training_batch
    self.batch_loss_value.append(loss)
  File "/shared/storage/cs/staffstore/username/anaconda3/envs/sh1/lib/python3.7/site-packages/pytorch_lightning/trainer/supporters.py", line 44, in append
    x = x.to(self.memory)
RuntimeError: CUDA error: an illegal memory access was encountered

To Reproduce

Environment

CUDA:
- GPU:
- Quadro P6000
- available: True
- version: 10.2
Packages:
- numpy: 1.18.1
- pyTorch_debug: False
- pyTorch_version: 1.5.0
- pytorch-lightning: 0.7.6
- tensorboard: 2.2.2
- tqdm: 4.46.1
System:
- OS: Linux
- architecture:
- 64bit
-
- processor: x86_64
- python: 3.7.0
- version: Enable any ML experiment tracking framework #47~18.04.1-Ubuntu SMP Thu May 7 13:10:50 UTC 2020

The text was updated successfully, but these errors were encountered:

brucemuller · 2020-06-11T11:43:28Z

Seems to be from calling .to(''cuda")

I think I'm using latest Pytorch Lightning: is there anything I can do?

williamFalcon · 2020-06-11T11:49:00Z

try it without 16 bit? or use native amp?

brucemuller · 2020-06-11T12:01:28Z

I don't think I'm using 16bit. My trainer is:

trainer = Trainer(nb_sanity_val_steps=1 ,gpus=1 , default_save_path=logdir1 , checkpoint_callback=checkpoint_callback , logger = tt_logger , use_amp=False ,min_nb_epochs=20000, max_nb_epochs=20000)

any ideas?

williamFalcon · 2020-06-11T12:30:24Z

are you on 0.8.0rc1?

which distributed mode are you using? try ddp_spawn

brucemuller · 2020-06-11T12:55:06Z

My lightning version is 0.7.6 : how can I update?

I'm using default dist mode. I tried ddp_spawn but now tensors seem to not being sent to GPU. After

imgs, t_1to2_tar  = batch

They are on cpu. Is this normal?

Borda · 2020-06-11T16:24:43Z

My lightning version is 0.7.6 : how can I update?

it is already on PyPI, so just pip install pytorch-lightning -U

brucemuller · 2020-06-12T09:08:24Z

Thanks! Using 0.8.0rc1 and/or ddp_spawn does not help :(
Here's another trace:

 File "/shared/storage/cs/staffstore/brm512/anaconda3/envs/sh2/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 20, in _wrap
    fn(i, *args)
  File "/shared/storage/cs/staffstore/brm512/anaconda3/envs/sh2/lib/python3.7/site-packages/pytorch_lightning/trainer/distrib_data_parallel.py", line 472, in ddp_train
    self.run_pretrain_routine(model)
  File "/shared/storage/cs/staffstore/brm512/anaconda3/envs/sh2/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1050, in run_pretrain_routine
    self.train()
  File "/shared/storage/cs/staffstore/brm512/anaconda3/envs/sh2/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 363, in train 
    self.run_training_epoch()
  File "/shared/storage/cs/staffstore/brm512/anaconda3/envs/sh2/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 445, in run_training_epoch
    _outputs = self.run_training_batch(batch, batch_idx)
  File "/shared/storage/cs/staffstore/brm512/anaconda3/envs/sh2/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 621, in run_training_batch
    loss, batch_output = optimizer_closure()
  File "/shared/storage/cs/staffstore/brm512/anaconda3/envs/sh2/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 585, in optimizer_closure
    output_dict = self.training_forward(split_batch, batch_idx, opt_idx, self.hiddens)
  File "/shared/storage/cs/staffstore/brm512/anaconda3/envs/sh2/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 737, in training_forward
    output = self.model(*args)
  File "/shared/storage/cs/staffstore/brm512/anaconda3/envs/sh2/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/shared/storage/cs/staffstore/brm512/anaconda3/envs/sh2/lib/python3.7/site-packages/pytorch_lightning/overrides/data_parallel.py", line 92, in forward
    output = self.module.training_step(*inputs[0], **kwargs[0])
  File "/home/userfs/b/brm512/experiments/HomographyNet/lightning_module.py", line 859, in training_step
    self.loss_meter_training.update(float(total_loss))
RuntimeError: CUDA error: an illegal memory access was encountered

williamFalcon · 2020-06-12T12:00:21Z

can you share a small snippet we can use to reproduce?

pvnieo · 2020-06-12T14:49:12Z

I'm having also this issue, but it seems that it happens randomly sometimes, so it's difficult for me to provide a small snippet for reproducing purposes.

williamFalcon · 2020-06-13T14:01:27Z

this seems to be related to mixing apex and cuda somehow.
pytorch/pytorch#21819

brucemuller · 2020-06-18T10:08:21Z

I'm having better success with pytorch 1.6 (nightly), I recommend trying that.

I have apex installed but I haven't set the Trainer to use amp, but maybe it could still be related?

ddavila-kitware · 2020-07-16T19:24:42Z

@brucemuller Any luck on this? I have run into the same issue, even using the nightly build of torch. It seems to be related to memory usage, lowering my batch size helps but Im not sure why yet. It chugs along fine with plenty of memory on GPU (4G / 16G) and then at the end of the batch, it suddenly fails with this error.

binshengliu · 2020-08-24T06:45:35Z

This may be a clue. I also encountered this error and had the same bottom stack trace (last frame being x = x.to(self.memory)). The condition is using apex for fp16 and specifying non-zero GPU IDs through Trainer, like Trainer(gpus=[2,3]). The global visible GPUs are still 0,1,2,3 for example.

When apex initializes, it creates a dummy tensor on the default cuda device, which is 0 in this case. In some batch's backpropagation, the tensor will be used when some conditions are met. But the model parameters are on device 2 and 3 then the error occurs. I don't know how apex works so I'm not 100% sure the reasoning is correct. But I debugged into the initialize function and can confirm the dummy tensor was created on device 0 which I suspect is the root of the problem.

The workaround for me is to use the environment variable CUDA_VISIBLE_DEVICES to specify GPUs.

CUDA_VISIBLE_DEVICES=2,3 python train.py instead of Trainer(gpus=[2,3]).

williamFalcon · 2020-08-24T09:17:47Z

could you try with 0.9? because we set the cuda flag for you automatically

binshengliu · 2020-08-24T11:35:35Z

I can still reproduce the error with 0.9.0 with gpus=[1].

It seems CUDA_VISIBLE_DEVICES is set too late and has no effect on pytorch's visibility of the devices. I think it should be set before import but that's out of this package's control.

I paused at configure_apex function and inspected some variables.

(Pdb++) import os
(Pdb++) os.environ["CUDA_VISIBLE_DEVICES"]
'1'
(Pdb++) torch.cuda.current_device()
0
(Pdb++) torch.cuda.device_count()
2

The env variable is correctly set but torch can still see 2 devices. Then the error is raised.

Setting it outside the script works.

stale · 2020-10-21T22:01:58Z

This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions, Pytorch Lightning Team!

YimengZhu · 2021-04-06T07:25:07Z

I can still reproduce the error with 0.9.0 with gpus=[1].

It seems CUDA_VISIBLE_DEVICES is set too late and has no effect on pytorch's visibility of the devices. I think it should be set before import but that's out of this package's control.

I paused at configure_apex function and inspected some variables.
(Pdb++) import os
(Pdb++) os.environ["CUDA_VISIBLE_DEVICES"]
'1'
(Pdb++) torch.cuda.current_device()
0
(Pdb++) torch.cuda.device_count()
2
The env variable is correctly set but torch can still see 2 devices. Then the error is raised.

Setting it outside the script works.

@binshengliu Hi, any updates on this? I run into the same issue with apex dpp training with fp16 enabled. In my case it is very obvious that apex caused the problem. Ref. the following code snippet

amp_state_dict = apex.amp.state_dict()
loss_scale = amp_state_dict['loss_scaler0']['loss_scale']
my_tensor.mul_(loss_scale)

The illegal memory access encountered error is triggered in my_tensor.mul_(loss_scale) .

Do you have any more suggestions regards to it?

Thanks in advance!

binshengliu · 2021-04-06T23:44:14Z

I encountered this issue when I didn't use GPU 0. My workaround was to specify GPUs through the environment variable.

CUDA_VISIBLE_DEVICES=1,2 python train.py

Other than that, I have no idea. Maybe try using dp or pytorch's native fp16?

brucemuller added the help wanted Open to be worked on label Jun 5, 2020

williamFalcon mentioned this issue Jun 8, 2020

Adds back the slow spawn ddp implementation that people want #2115

Merged

williamFalcon closed this as completed in #2115 Jun 8, 2020

Borda reopened this Jun 12, 2020

williamFalcon mentioned this issue Jun 13, 2020

RuntimeError: CUDA error: an illegal memory access was encountered pytorch/pytorch#21819

Closed

stale bot added the won't fix This will not be worked on label Oct 21, 2020

stale bot closed this as completed Oct 28, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CUDA error: an illegal memory access was encountered after updating to the latest stable packages #2085

CUDA error: an illegal memory access was encountered after updating to the latest stable packages #2085

brucemuller commented Jun 5, 2020 •

edited by Borda

Loading

brucemuller commented Jun 11, 2020

williamFalcon commented Jun 11, 2020

brucemuller commented Jun 11, 2020 •

edited by Borda

Loading

williamFalcon commented Jun 11, 2020

brucemuller commented Jun 11, 2020 •

edited by Borda

Loading

Borda commented Jun 11, 2020

brucemuller commented Jun 12, 2020 •

edited by Borda

Loading

williamFalcon commented Jun 12, 2020

pvnieo commented Jun 12, 2020

williamFalcon commented Jun 13, 2020

brucemuller commented Jun 18, 2020

ddavila-kitware commented Jul 16, 2020 •

edited

Loading

binshengliu commented Aug 24, 2020

williamFalcon commented Aug 24, 2020

binshengliu commented Aug 24, 2020

stale bot commented Oct 21, 2020

YimengZhu commented Apr 6, 2021 •

edited

Loading

binshengliu commented Apr 6, 2021

CUDA error: an illegal memory access was encountered after updating to the latest stable packages #2085

CUDA error: an illegal memory access was encountered after updating to the latest stable packages #2085

Comments

brucemuller commented Jun 5, 2020 • edited by Borda Loading

🐛 Bug

To Reproduce

Environment

brucemuller commented Jun 11, 2020

williamFalcon commented Jun 11, 2020

brucemuller commented Jun 11, 2020 • edited by Borda Loading

williamFalcon commented Jun 11, 2020

brucemuller commented Jun 11, 2020 • edited by Borda Loading

Borda commented Jun 11, 2020

brucemuller commented Jun 12, 2020 • edited by Borda Loading

williamFalcon commented Jun 12, 2020

pvnieo commented Jun 12, 2020

williamFalcon commented Jun 13, 2020

brucemuller commented Jun 18, 2020

ddavila-kitware commented Jul 16, 2020 • edited Loading

binshengliu commented Aug 24, 2020

williamFalcon commented Aug 24, 2020

binshengliu commented Aug 24, 2020

stale bot commented Oct 21, 2020

YimengZhu commented Apr 6, 2021 • edited Loading

binshengliu commented Apr 6, 2021

brucemuller commented Jun 5, 2020 •

edited by Borda

Loading

brucemuller commented Jun 11, 2020 •

edited by Borda

Loading

brucemuller commented Jun 11, 2020 •

edited by Borda

Loading

brucemuller commented Jun 12, 2020 •

edited by Borda

Loading

ddavila-kitware commented Jul 16, 2020 •

edited

Loading

YimengZhu commented Apr 6, 2021 •

edited

Loading