Skip to content

Stop after validation sanity checking. #20682

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
Uzukidd opened this issue Mar 28, 2025 · 1 comment
Open

Stop after validation sanity checking. #20682

Uzukidd opened this issue Mar 28, 2025 · 1 comment
Labels
bug Something isn't working needs triage Waiting to be triaged by maintainers ver: 2.5.x

Comments

@Uzukidd
Copy link

Uzukidd commented Mar 28, 2025

Bug description

I configure a full validation epoch before training epoch by settingnum_sanity_val_steps: -1 with 4 RTX4090s. Buftthe program just stops after the validation epoch ends while the GPU util is 100%.
Nothing I can do but using pkill -9 python to terminate the process (I even tried ctrl + c to interrupt the process but it does not works.).
Image
Image
I also tried setting trace after validation sanity checking, it seems that some errors occur while calling self.fit_loop.run() but uncaught.
Image
I paste the codes of my datamodule here.
Please tell me if any further information needed!

What version are you seeing the problem on?

v2.5

How to reproduce the bug

class pcdet_dataset(L.LightningDataModule):
    def __init__(self, pcdet_dataset_config:dict, 
                 class_names: List[str],
                 batch_size:int, 
                 dist_train:bool, 
                 workers:int,
                 merge_all_iters_to_one_epoch:bool,
                 total_epochs:int,
                ):
        super().__init__()
        self.pcdet_dataset_config = convert_to_easydict(pcdet_dataset_config)
        
        self.class_names = class_names
        self.dataset = None
        self.batch_size = batch_size
        self.dist_train = dist_train
        self.workers = workers
        self.merge_all_iters_to_one_epoch = merge_all_iters_to_one_epoch
        self.total_epochs = total_epochs
    
    def setup(self, stage: str) -> None:
        from pcdet.datasets import __all__
        print(f"STAGE: {stage}")
        if self.dataset is not None:
            return
        
        if stage in ("fit", "validate"):
            self.dataset:DatasetTemplate = __all__[self.pcdet_dataset_config.DATASET](
                dataset_cfg=self.pcdet_dataset_config,
                class_names=self.class_names,
                root_path=None,
                training=False,
                logger=print_logger(),
            )
        elif stage == "test":
            self.dataset:DatasetTemplate = __all__[self.pcdet_dataset_config.DATASET](
                dataset_cfg=self.pcdet_dataset_config,
                class_names=self.class_names,
                root_path=None,
                training=False,
                logger=print_logger(),
            )

    def build_datalaoder(self) -> DataLoader:
        return DataLoader(self.dataset, 
                          batch_size=self.batch_size, 
                          shuffle = False,
                          collate_fn=voxel_collate_batch,)

    def train_dataloader(self) -> DataLoader:
        return self.build_datalaoder()
        
    def test_dataloader(self) -> DataLoader:
        return self.build_datalaoder()

    def val_dataloader(self) -> DataLoader:
        return self.build_datalaoder()

Error messages and logs

-> self.state.stage = stage
(Pdb) pp stage
<RunningStage.TRAINING: 'train'>
(Pdb) l
1091 # reset the progress tracking state after sanity checking. we don't need to set the state before
1092 # because sanity check only runs when we are not restarting
1093 _reset_progress(val_loop)
1094
1095 # restore the previous stage when the sanity check if finished
1096 -> self.state.stage = stage
1097
1098 def __setup_profiler(self) -> None:
1099 assert self.state.fn is not None
1100 local_rank = self.local_rank if self.world_size > 1 else None
1101 self.profiler._lightning_module = proxy(self.lightning_module)
(Pdb) n

/home/ksas/anaconda3/envs/physical_attack/lib/python3.12/site-packages/lightning/pytorch/trainer/trainer.py(1053)_run_stage()
-> with isolate_rng():
(Pdb) n
/home/ksas/anaconda3/envs/physical_attack/lib/python3.12/site-packages/lightning/pytorch/trainer/trainer.py(1055)_run_stage()
-> with torch.autograd.set_detect_anomaly(self._detect_anomaly):
(Pdb) n
/home/ksas/anaconda3/envs/physical_attack/lib/python3.12/site-packages/lightning/pytorch/trainer/trainer.py(1056)_run_stage()
-> self.fit_loop.run()
(Pdb) n
/home/ksas/anaconda3/envs/physical_attack/lib/python3.12/site-packages/lightning/pytorch/trainer/connectors/data_connector.py:425: The 'train_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the num_workers argumenttonum_workers=31in theDataLoader` to improve performance.

------ the process stops forever here --------

Environment

Current environment
  • CUDA:
    - GPU:
    - NVIDIA GeForce RTX 4090
    - NVIDIA GeForce RTX 4090
    - NVIDIA GeForce RTX 4090
    - NVIDIA GeForce RTX 4090
    - available: True
    - version: 11.8
  • Lightning:
    - lightning: 2.5.1
    - lightning-utilities: 0.14.2
    - pytorch-lightning: 2.5.1
    - pytorch3d: 0.7.8
    - raytorch: 0.1.0+aeaaf25
    - torch: 2.6.0+cu118
    - torchaudio: 2.6.0+cu118
    - torchmetrics: 1.7.0
    - torchvision: 0.21.0+cu118
  • Packages:
    - addict: 2.4.0
    - aiofiles: 24.1.0
    - aiohappyeyeballs: 2.6.1
    - aiohttp: 3.11.14
    - aiosignal: 1.3.2
    - antlr4-python3-runtime: 4.9.3
    - anyio: 4.9.0
    - apptools: 5.3.0
    - argcomplete: 3.6.1
    - argon2-cffi: 23.1.0
    - argon2-cffi-bindings: 21.2.0
    - arrow: 1.3.0
    - asttokens: 3.0.0
    - async-lru: 2.0.5
    - attrs: 25.3.0
    - autocommand: 2.2.2
    - av: 14.2.0
    - av2: 0.3.4
    - babel: 2.17.0
    - backports.tarfile: 1.2.0
    - beautifulsoup4: 4.13.3
    - bleach: 6.2.0
    - blinker: 1.9.0
    - cachetools: 5.5.2
    - ccimport: 0.4.4
    - certifi: 2025.1.31
    - cffi: 1.17.1
    - charset-normalizer: 3.4.1
    - click: 8.1.8
    - colorlog: 6.9.0
    - comm: 0.2.2
    - configargparse: 1.7
    - configobj: 5.0.9
    - contourpy: 1.3.1
    - cumm-cu118: 0.7.11
    - cycler: 0.12.1
    - dash: 3.0.1
    - debugpy: 1.8.13
    - decorator: 5.2.1
    - defusedxml: 0.7.1
    - dependency-groups: 1.3.0
    - descartes: 1.1.0
    - distlib: 0.3.9
    - docstring-parser: 0.16
    - easydict: 1.13
    - envisage: 7.0.3
    - executing: 2.2.0
    - fastjsonschema: 2.21.1
    - filelock: 3.13.1
    - fire: 0.7.0
    - flask: 3.0.3
    - fonttools: 4.56.0
    - fqdn: 1.5.1
    - frozenlist: 1.5.0
    - fsspec: 2024.6.1
    - h11: 0.14.0
    - httpcore: 1.0.7
    - httpx: 0.28.1
    - idna: 3.10
    - imageio: 2.37.0
    - importlib-metadata: 8.6.1
    - importlib-resources: 6.5.2
    - inflect: 7.3.1
    - iopath: 0.1.10
    - ipykernel: 6.29.5
    - ipython: 9.0.2
    - ipython-pygments-lexers: 1.1.1
    - ipywidgets: 8.1.5
    - isoduration: 20.11.0
    - itsdangerous: 2.2.0
    - jaraco.collections: 5.1.0
    - jaraco.context: 5.3.0
    - jaraco.functools: 4.0.1
    - jaraco.text: 3.12.1
    - jedi: 0.19.2
    - jinja2: 3.1.4
    - joblib: 1.4.2
    - json5: 0.10.0
    - jsonargparse: 4.37.0
    - jsonpointer: 3.0.0
    - jsonschema: 4.23.0
    - jsonschema-specifications: 2024.10.1
    - jupyter: 1.1.1
    - jupyter-client: 8.6.3
    - jupyter-console: 6.6.3
    - jupyter-core: 5.7.2
    - jupyter-events: 0.12.0
    - jupyter-lsp: 2.2.5
    - jupyter-server: 2.15.0
    - jupyter-server-terminals: 0.5.3
    - jupyterlab: 4.3.6
    - jupyterlab-pygments: 0.3.0
    - jupyterlab-server: 2.27.3
    - jupyterlab-widgets: 3.0.13
    - kiwisolver: 1.4.8
    - kornia: 0.6.8
    - kornia-rs: 0.1.8
    - lark: 1.2.2
    - lazy-loader: 0.4
    - lightning: 2.5.1
    - lightning-utilities: 0.14.2
    - llvmlite: 0.44.0
    - markdown-it-py: 3.0.0
    - markupsafe: 2.1.5
    - matplotlib: 3.5.3
    - matplotlib-inline: 0.1.7
    - mayavi: 4.8.2
    - mdurl: 0.1.2
    - mistune: 3.1.3
    - more-itertools: 10.3.0
    - motmetrics: 1.4.0
    - mpmath: 1.3.0
    - multidict: 6.2.0
    - narwhals: 1.32.0
    - nbclient: 0.10.2
    - nbconvert: 7.16.6
    - nbformat: 5.10.4
    - nest-asyncio: 1.6.0
    - networkx: 3.3
    - ninja: 1.11.1.4
    - notebook: 7.3.3
    - notebook-shim: 0.2.4
    - nox: 2025.2.9
    - numba: 0.61.0
    - numpy: 1.26.4
    - nuscenes-devkit: 1.1.11
    - nvidia-cublas-cu11: 11.11.3.6
    - nvidia-cuda-cupti-cu11: 11.8.87
    - nvidia-cuda-nvrtc-cu11: 11.8.89
    - nvidia-cuda-runtime-cu11: 11.8.89
    - nvidia-cudnn-cu11: 9.1.0.70
    - nvidia-cufft-cu11: 10.9.0.58
    - nvidia-curand-cu11: 10.3.0.86
    - nvidia-cusolver-cu11: 11.4.1.48
    - nvidia-cusparse-cu11: 11.7.5.86
    - nvidia-nccl-cu11: 2.21.5
    - nvidia-nvtx-cu11: 11.8.86
    - omegaconf: 2.3.0
    - open3d: 0.19.0
    - opencv-python: 4.11.0.86
    - overrides: 7.7.0
    - packaging: 24.2
    - pandas: 2.2.3
    - pandocfilters: 1.5.1
    - parso: 0.8.4
    - pccm: 0.4.16
    - pcdet: 0.6.0+8caccce
    - pexpect: 4.9.0
    - pillow: 11.0.0
    - pip: 25.0.1
    - platformdirs: 4.3.7
    - plotly: 6.0.1
    - polars: 1.25.2
    - portalocker: 3.1.1
    - prometheus-client: 0.21.1
    - prompt-toolkit: 3.0.50
    - propcache: 0.3.0
    - protobuf: 6.30.1
    - psutil: 7.0.0
    - ptyprocess: 0.7.0
    - pure-eval: 0.2.3
    - pyarrow: 19.0.1
    - pybind11: 2.13.6
    - pycocotools: 2.0.8
    - pycparser: 2.22
    - pyface: 8.0.0
    - pygments: 2.19.1
    - pyparsing: 3.2.1
    - pyproj: 3.7.1
    - pyqt5-qt5: 5.15.16
    - pyqt5-sip: 12.17.0
    - pyquaternion: 0.9.9
    - python-dateutil: 2.9.0.post0
    - python-json-logger: 3.3.0
    - pytorch-lightning: 2.5.1
    - pytorch3d: 0.7.8
    - pytz: 2025.1
    - pyyaml: 6.0.2
    - pyzmq: 26.3.0
    - raytorch: 0.1.0+aeaaf25
    - referencing: 0.36.2
    - requests: 2.32.3
    - retrying: 1.3.4
    - rfc3339-validator: 0.1.4
    - rfc3986-validator: 0.1.1
    - rich: 13.9.4
    - rpds-py: 0.23.1
    - scikit-image: 0.25.2
    - scikit-learn: 1.6.1
    - scipy: 1.15.2
    - send2trash: 1.8.3
    - setuptools: 77.0.3
    - shapely: 1.8.5.post1
    - sharedarray: 3.2.4
    - six: 1.17.0
    - sniffio: 1.3.1
    - sort-vertices: 0.0.0
    - soupsieve: 2.6
    - spconv-cu118: 2.3.8
    - stack-data: 0.6.3
    - sympy: 1.13.1
    - tensorboardx: 2.6.2.2
    - termcolor: 2.5.0
    - terminado: 0.18.1
    - threadpoolctl: 3.6.0
    - tifffile: 2025.3.13
    - tinycss2: 1.4.0
    - tomli: 2.0.1
    - torch: 2.6.0+cu118
    - torchaudio: 2.6.0+cu118
    - torchmetrics: 1.7.0
    - torchvision: 0.21.0+cu118
    - tornado: 6.4.2
    - tqdm: 4.67.1
    - traitlets: 5.14.3
    - traits: 7.0.2
    - traitsui: 8.0.0
    - triton: 3.2.0
    - typeguard: 4.3.0
    - types-python-dateutil: 2.9.0.20241206
    - typeshed-client: 2.7.0
    - typing-extensions: 4.12.2
    - tzdata: 2025.1
    - universal-pathlib: 0.2.6
    - uri-template: 1.3.0
    - urllib3: 2.3.0
    - virtualenv: 20.29.3
    - voxel-ops: 0.0.0
    - vtk: 9.4.1
    - wcwidth: 0.2.13
    - webcolors: 24.11.1
    - webencodings: 0.5.1
    - websocket-client: 1.8.0
    - werkzeug: 3.0.6
    - wheel: 0.45.1
    - widgetsnbextension: 4.0.13
    - xmltodict: 0.14.2
    - yarl: 1.18.3
    - zipp: 3.21.0
  • System:
    - OS: Linux
    - architecture:
    - 64bit
    - ELF
    - processor: x86_64
    - python: 3.12.9
    - release: 6.8.0-52-generic
    - version: Use black for autoformatting #53~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Wed Jan 15 19:18:46 UTC 2

More info

No response

@Uzukidd Uzukidd added bug Something isn't working needs triage Waiting to be triaged by maintainers labels Mar 28, 2025
Copy link

stale bot commented Apr 28, 2025

This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions - the Lightning Team!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working needs triage Waiting to be triaged by maintainers ver: 2.5.x
Projects
None yet
Development

No branches or pull requests

1 participant