You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I configure a full validation epoch before training epoch by settingnum_sanity_val_steps: -1 with 4 RTX4090s. Buftthe program just stops after the validation epoch ends while the GPU util is 100%.
Nothing I can do but using pkill -9 python to terminate the process (I even tried ctrl + c to interrupt the process but it does not works.).
I also tried setting trace after validation sanity checking, it seems that some errors occur while calling self.fit_loop.run() but uncaught.
I paste the codes of my datamodule here.
Please tell me if any further information needed!
-> self.state.stage = stage
(Pdb) pp stage
<RunningStage.TRAINING: 'train'>
(Pdb) l
1091 # reset the progress tracking state after sanity checking. we don't need to set the state before
1092 # because sanity check only runs when we are not restarting
1093 _reset_progress(val_loop)
1094
1095 # restore the previous stage when the sanity check if finished
1096 -> self.state.stage = stage
1097
1098 def __setup_profiler(self) -> None:
1099 assert self.state.fn is not None
1100 local_rank = self.local_rank if self.world_size > 1 else None
1101 self.profiler._lightning_module = proxy(self.lightning_module)
(Pdb) n
/home/ksas/anaconda3/envs/physical_attack/lib/python3.12/site-packages/lightning/pytorch/trainer/trainer.py(1053)_run_stage()
-> with isolate_rng():
(Pdb) n
/home/ksas/anaconda3/envs/physical_attack/lib/python3.12/site-packages/lightning/pytorch/trainer/trainer.py(1055)_run_stage()
-> with torch.autograd.set_detect_anomaly(self._detect_anomaly):
(Pdb) n
/home/ksas/anaconda3/envs/physical_attack/lib/python3.12/site-packages/lightning/pytorch/trainer/trainer.py(1056)_run_stage()
-> self.fit_loop.run()
(Pdb) n
/home/ksas/anaconda3/envs/physical_attack/lib/python3.12/site-packages/lightning/pytorch/trainer/connectors/data_connector.py:425: The 'train_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the num_workers argumenttonum_workers=31in theDataLoader` to improve performance.
This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions - the Lightning Team!
Bug description
I configure a full validation epoch before training epoch by setting



num_sanity_val_steps: -1
with 4 RTX4090s. Buftthe program just stops after the validation epoch ends while the GPU util is 100%.Nothing I can do but using
pkill -9 python
to terminate the process (I even tried ctrl + c to interrupt the process but it does not works.).I also tried setting trace after validation sanity checking, it seems that some errors occur while calling self.fit_loop.run() but uncaught.
I paste the codes of my datamodule here.
Please tell me if any further information needed!
What version are you seeing the problem on?
v2.5
How to reproduce the bug
Error messages and logs
-> self.state.stage = stage
(Pdb) pp stage
<RunningStage.TRAINING: 'train'>
(Pdb) l
1091 # reset the progress tracking state after sanity checking. we don't need to set the state before
1092 # because sanity check only runs when we are not restarting
1093 _reset_progress(val_loop)
1094
1095 # restore the previous stage when the sanity check if finished
1096 -> self.state.stage = stage
1097
1098 def __setup_profiler(self) -> None:
1099 assert self.state.fn is not None
1100 local_rank = self.local_rank if self.world_size > 1 else None
1101 self.profiler._lightning_module = proxy(self.lightning_module)
(Pdb) n
------ the process stops forever here --------
Environment
Current environment
- GPU:
- NVIDIA GeForce RTX 4090
- NVIDIA GeForce RTX 4090
- NVIDIA GeForce RTX 4090
- NVIDIA GeForce RTX 4090
- available: True
- version: 11.8
- lightning: 2.5.1
- lightning-utilities: 0.14.2
- pytorch-lightning: 2.5.1
- pytorch3d: 0.7.8
- raytorch: 0.1.0+aeaaf25
- torch: 2.6.0+cu118
- torchaudio: 2.6.0+cu118
- torchmetrics: 1.7.0
- torchvision: 0.21.0+cu118
- addict: 2.4.0
- aiofiles: 24.1.0
- aiohappyeyeballs: 2.6.1
- aiohttp: 3.11.14
- aiosignal: 1.3.2
- antlr4-python3-runtime: 4.9.3
- anyio: 4.9.0
- apptools: 5.3.0
- argcomplete: 3.6.1
- argon2-cffi: 23.1.0
- argon2-cffi-bindings: 21.2.0
- arrow: 1.3.0
- asttokens: 3.0.0
- async-lru: 2.0.5
- attrs: 25.3.0
- autocommand: 2.2.2
- av: 14.2.0
- av2: 0.3.4
- babel: 2.17.0
- backports.tarfile: 1.2.0
- beautifulsoup4: 4.13.3
- bleach: 6.2.0
- blinker: 1.9.0
- cachetools: 5.5.2
- ccimport: 0.4.4
- certifi: 2025.1.31
- cffi: 1.17.1
- charset-normalizer: 3.4.1
- click: 8.1.8
- colorlog: 6.9.0
- comm: 0.2.2
- configargparse: 1.7
- configobj: 5.0.9
- contourpy: 1.3.1
- cumm-cu118: 0.7.11
- cycler: 0.12.1
- dash: 3.0.1
- debugpy: 1.8.13
- decorator: 5.2.1
- defusedxml: 0.7.1
- dependency-groups: 1.3.0
- descartes: 1.1.0
- distlib: 0.3.9
- docstring-parser: 0.16
- easydict: 1.13
- envisage: 7.0.3
- executing: 2.2.0
- fastjsonschema: 2.21.1
- filelock: 3.13.1
- fire: 0.7.0
- flask: 3.0.3
- fonttools: 4.56.0
- fqdn: 1.5.1
- frozenlist: 1.5.0
- fsspec: 2024.6.1
- h11: 0.14.0
- httpcore: 1.0.7
- httpx: 0.28.1
- idna: 3.10
- imageio: 2.37.0
- importlib-metadata: 8.6.1
- importlib-resources: 6.5.2
- inflect: 7.3.1
- iopath: 0.1.10
- ipykernel: 6.29.5
- ipython: 9.0.2
- ipython-pygments-lexers: 1.1.1
- ipywidgets: 8.1.5
- isoduration: 20.11.0
- itsdangerous: 2.2.0
- jaraco.collections: 5.1.0
- jaraco.context: 5.3.0
- jaraco.functools: 4.0.1
- jaraco.text: 3.12.1
- jedi: 0.19.2
- jinja2: 3.1.4
- joblib: 1.4.2
- json5: 0.10.0
- jsonargparse: 4.37.0
- jsonpointer: 3.0.0
- jsonschema: 4.23.0
- jsonschema-specifications: 2024.10.1
- jupyter: 1.1.1
- jupyter-client: 8.6.3
- jupyter-console: 6.6.3
- jupyter-core: 5.7.2
- jupyter-events: 0.12.0
- jupyter-lsp: 2.2.5
- jupyter-server: 2.15.0
- jupyter-server-terminals: 0.5.3
- jupyterlab: 4.3.6
- jupyterlab-pygments: 0.3.0
- jupyterlab-server: 2.27.3
- jupyterlab-widgets: 3.0.13
- kiwisolver: 1.4.8
- kornia: 0.6.8
- kornia-rs: 0.1.8
- lark: 1.2.2
- lazy-loader: 0.4
- lightning: 2.5.1
- lightning-utilities: 0.14.2
- llvmlite: 0.44.0
- markdown-it-py: 3.0.0
- markupsafe: 2.1.5
- matplotlib: 3.5.3
- matplotlib-inline: 0.1.7
- mayavi: 4.8.2
- mdurl: 0.1.2
- mistune: 3.1.3
- more-itertools: 10.3.0
- motmetrics: 1.4.0
- mpmath: 1.3.0
- multidict: 6.2.0
- narwhals: 1.32.0
- nbclient: 0.10.2
- nbconvert: 7.16.6
- nbformat: 5.10.4
- nest-asyncio: 1.6.0
- networkx: 3.3
- ninja: 1.11.1.4
- notebook: 7.3.3
- notebook-shim: 0.2.4
- nox: 2025.2.9
- numba: 0.61.0
- numpy: 1.26.4
- nuscenes-devkit: 1.1.11
- nvidia-cublas-cu11: 11.11.3.6
- nvidia-cuda-cupti-cu11: 11.8.87
- nvidia-cuda-nvrtc-cu11: 11.8.89
- nvidia-cuda-runtime-cu11: 11.8.89
- nvidia-cudnn-cu11: 9.1.0.70
- nvidia-cufft-cu11: 10.9.0.58
- nvidia-curand-cu11: 10.3.0.86
- nvidia-cusolver-cu11: 11.4.1.48
- nvidia-cusparse-cu11: 11.7.5.86
- nvidia-nccl-cu11: 2.21.5
- nvidia-nvtx-cu11: 11.8.86
- omegaconf: 2.3.0
- open3d: 0.19.0
- opencv-python: 4.11.0.86
- overrides: 7.7.0
- packaging: 24.2
- pandas: 2.2.3
- pandocfilters: 1.5.1
- parso: 0.8.4
- pccm: 0.4.16
- pcdet: 0.6.0+8caccce
- pexpect: 4.9.0
- pillow: 11.0.0
- pip: 25.0.1
- platformdirs: 4.3.7
- plotly: 6.0.1
- polars: 1.25.2
- portalocker: 3.1.1
- prometheus-client: 0.21.1
- prompt-toolkit: 3.0.50
- propcache: 0.3.0
- protobuf: 6.30.1
- psutil: 7.0.0
- ptyprocess: 0.7.0
- pure-eval: 0.2.3
- pyarrow: 19.0.1
- pybind11: 2.13.6
- pycocotools: 2.0.8
- pycparser: 2.22
- pyface: 8.0.0
- pygments: 2.19.1
- pyparsing: 3.2.1
- pyproj: 3.7.1
- pyqt5-qt5: 5.15.16
- pyqt5-sip: 12.17.0
- pyquaternion: 0.9.9
- python-dateutil: 2.9.0.post0
- python-json-logger: 3.3.0
- pytorch-lightning: 2.5.1
- pytorch3d: 0.7.8
- pytz: 2025.1
- pyyaml: 6.0.2
- pyzmq: 26.3.0
- raytorch: 0.1.0+aeaaf25
- referencing: 0.36.2
- requests: 2.32.3
- retrying: 1.3.4
- rfc3339-validator: 0.1.4
- rfc3986-validator: 0.1.1
- rich: 13.9.4
- rpds-py: 0.23.1
- scikit-image: 0.25.2
- scikit-learn: 1.6.1
- scipy: 1.15.2
- send2trash: 1.8.3
- setuptools: 77.0.3
- shapely: 1.8.5.post1
- sharedarray: 3.2.4
- six: 1.17.0
- sniffio: 1.3.1
- sort-vertices: 0.0.0
- soupsieve: 2.6
- spconv-cu118: 2.3.8
- stack-data: 0.6.3
- sympy: 1.13.1
- tensorboardx: 2.6.2.2
- termcolor: 2.5.0
- terminado: 0.18.1
- threadpoolctl: 3.6.0
- tifffile: 2025.3.13
- tinycss2: 1.4.0
- tomli: 2.0.1
- torch: 2.6.0+cu118
- torchaudio: 2.6.0+cu118
- torchmetrics: 1.7.0
- torchvision: 0.21.0+cu118
- tornado: 6.4.2
- tqdm: 4.67.1
- traitlets: 5.14.3
- traits: 7.0.2
- traitsui: 8.0.0
- triton: 3.2.0
- typeguard: 4.3.0
- types-python-dateutil: 2.9.0.20241206
- typeshed-client: 2.7.0
- typing-extensions: 4.12.2
- tzdata: 2025.1
- universal-pathlib: 0.2.6
- uri-template: 1.3.0
- urllib3: 2.3.0
- virtualenv: 20.29.3
- voxel-ops: 0.0.0
- vtk: 9.4.1
- wcwidth: 0.2.13
- webcolors: 24.11.1
- webencodings: 0.5.1
- websocket-client: 1.8.0
- werkzeug: 3.0.6
- wheel: 0.45.1
- widgetsnbextension: 4.0.13
- xmltodict: 0.14.2
- yarl: 1.18.3
- zipp: 3.21.0
- OS: Linux
- architecture:
- 64bit
- ELF
- processor: x86_64
- python: 3.12.9
- release: 6.8.0-52-generic
- version: Use
black
for autoformatting #53~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Wed Jan 15 19:18:46 UTC 2More info
No response
The text was updated successfully, but these errors were encountered: