-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CUDA error: an illegal memory access was encountered after updating to the latest stable packages #2085
Comments
Seems to be from calling .to(''cuda") I think I'm using latest Pytorch Lightning: is there anything I can do? |
try it without 16 bit? or use native amp? |
I don't think I'm using 16bit. My trainer is:
any ideas? |
are you on 0.8.0rc1? which distributed mode are you using? try |
My lightning version is 0.7.6 : how can I update? I'm using default dist mode. I tried
They are on cpu. Is this normal? |
it is already on PyPI, so just |
Thanks! Using 0.8.0rc1 and/or ddp_spawn does not help :(
|
can you share a small snippet we can use to reproduce? |
I'm having also this issue, but it seems that it happens randomly sometimes, so it's difficult for me to provide a small snippet for reproducing purposes. |
this seems to be related to mixing apex and cuda somehow. |
I'm having better success with pytorch 1.6 (nightly), I recommend trying that. I have apex installed but I haven't set the Trainer to use amp, but maybe it could still be related? |
@brucemuller Any luck on this? I have run into the same issue, even using the nightly build of torch. It seems to be related to memory usage, lowering my batch size helps but Im not sure why yet. It chugs along fine with plenty of memory on GPU (4G / 16G) and then at the end of the batch, it suddenly fails with this error. |
This may be a clue. I also encountered this error and had the same bottom stack trace (last frame being When apex initializes, it creates a dummy tensor on the default cuda device, which is The workaround for me is to use the environment variable
|
could you try with 0.9? because we set the cuda flag for you automatically |
I can still reproduce the error with 0.9.0 with It seems I paused at
The env variable is correctly set but torch can still see 2 devices. Then the error is raised. Setting it outside the script works. |
This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions, Pytorch Lightning Team! |
@binshengliu Hi, any updates on this? I run into the same issue with apex dpp training with fp16 enabled. In my case it is very obvious that apex caused the problem. Ref. the following code snippet
The illegal memory access encountered error is triggered in Do you have any more suggestions regards to it? Thanks in advance! |
I encountered this issue when I didn't use GPU 0. My workaround was to specify GPUs through the environment variable.
Other than that, I have no idea. Maybe try using |
Can anyone help with this CUDA error: an illegal memory access was encountered ??
It runs fine for several iterations...
🐛 Bug
To Reproduce
Environment
- GPU:
- Quadro P6000
- available: True
- version: 10.2
- numpy: 1.18.1
- pyTorch_debug: False
- pyTorch_version: 1.5.0
- pytorch-lightning: 0.7.6
- tensorboard: 2.2.2
- tqdm: 4.46.1
- OS: Linux
- architecture:
- 64bit
-
- processor: x86_64
- python: 3.7.0
- version: Enable any ML experiment tracking framework #47~18.04.1-Ubuntu SMP Thu May 7 13:10:50 UTC 2020
The text was updated successfully, but these errors were encountered: