Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Early stopping with val_check_interval and check_val_every_n_epoch stops too early #17736

Closed
noamsgl opened this issue Jun 1, 2023 · 1 comment
Labels
bug Something isn't working needs triage Waiting to be triaged by maintainers ver: 1.9.x

Comments

@noamsgl
Copy link

noamsgl commented Jun 1, 2023

Bug description

While training with an iterable-style dataset, I would like to monitor the validation performance every 1000 steps (minibatches). However, the early_stopping callback is actually stopping much too early, even though the metric is improving.

The logs show clearly that the validation metric improves, and yet the early stopping callback stops the training.

What am I doing wrong? Where is the bug?
Thanks

What version are you seeing the problem on?

v1.9

How to reproduce the bug

early_stop_callback = EarlyStopping(monitor="validation/auroc",
                                                min_delta=1e-3,
                                                patience=4,
                                                verbose=True,
                                                mode="max"
                                                )

trainer = pl.Trainer(
              callbacks=[checkpoint_callback, early_stop_callback, lr_monitor],
              logger=[wandb_logger],
              accelerator="gpu",
              devices=1,
              num_nodes=self.num_nodes,
              strategy=strategy,
              auto_scale_batch_size=False,
              auto_lr_find=False,
              max_epochs=self.max_epochs,
              limit_train_batches=1.0,
              limit_val_batches=1.0,
              limit_test_batches=1.0,
              check_val_every_n_epoch=1,
              val_check_interval=1000,
              precision=16
          )

Error messages and logs

--------------------------------------------------------------------------
32.4 K    Trainable params
0         Non-trainable params
32.4 K    Total params
0.065     Total estimated model params size (MB)
Epoch 0: : 2586it [1:37:11,  2.25s/it, loss=0.00187, v_num=iza3]
Metric validation/auroc improved. New best score: 0.703
Epoch 0: : 4172it [2:34:50,  2.23s/it, loss=0.000945, v_num=iza3]
Epoch 0: : 5758it [3:28:52,  2.18s/it, loss=0.000335, v_num=iza3]
Metric validation/auroc improved by 0.093 >= min_delta = 0.001. New best score: 0.796
Epoch 0: : 6655it [3:54:41,  2.12s/it, loss=0.000142, v_num=iza3]Adjusting learning rate of group 0 to 7.6846e-03.
Epoch 1: : 1000it [42:46,  2.57s/it, loss=0.000146, v_num=iza3]
Epoch 1: : 2586it [1:34:59,  2.20s/it, loss=0.000154, v_num=iza3]
Epoch 1: : 4172it [2:32:29,  2.19s/it, loss=0.00516, v_num=iza3]
Metric validation/auroc improved by 0.008 >= min_delta = 0.001. New best score: 0.804
Epoch 1: : 5758it [3:26:41,  2.15s/it, loss=0.00103, v_num=iza3]
Epoch 1: : 6667it [3:53:04,  2.10s/it, loss=8.49e-05, v_num=iza3]Adjusting learning rate of group 0 to 7.6619e-03.
Metric validation/auroc improved by 0.018 >= min_delta = 0.001. New best score: 0.821
Epoch 2: : 1000it [42:43,  2.56s/it, loss=0.000118, v_num=iza3]
Epoch 2: : 2586it [1:34:46,  2.20s/it, loss=0.00014, v_num=iza3]
Metric validation/auroc improved by 0.013 >= min_delta = 0.001. New best score: 0.834
Epoch 2: : 4172it [2:31:20,  2.18s/it, loss=0.00582, v_num=iza3]
Epoch 2: : 5758it [3:26:05,  2.15s/it, loss=0.000351, v_num=iza3]
Epoch 2: : 6679it [3:52:53,  2.09s/it, loss=0.00089, v_num=iza3]Adjusting learning rate of group 0 to 7.6241e-03.
Epoch 3: : 1000it [42:35,  2.56s/it, loss=0.000104, v_num=iza3]
[2023-06-01 09:43:58,613] {study_utils.py:271} INFO - <mem='11.37GB/62.53GB'|pmem='20.72GB'> After fit
Monitored metric validation/auroc did not improve in the last 4 records. Best score: 0.834. Signaling Trainer to stop.
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

Environment

Current environment
  • CUDA:
    • GPU:
      • NVIDIA GeForce RTX 3080
    • available: True
    • version: 11.3
  • Lightning:
    • lightning-utilities: 0.8.0
    • pytorch-lightning: 1.9.5
    • pytorch-resample: 0.1.0
    • torch: 1.12.1+cu113
    • torch-tb-profiler: 0.4.1
    • torchaudio: 0.12.1+cu113
    • torchmetrics: 0.11.4
    • torchsampler: 0.1.2
    • torchvision: 0.13.1+cu113
  • Packages:
    • absl-py: 1.4.0
    • aiohttp: 3.8.4
    • aiosignal: 1.3.1
    • alabaster: 0.7.13
    • alembic: 1.11.1
    • appdirs: 1.4.4
    • argh: 0.28.1
    • arrow: 1.2.3
    • asciitree: 0.3.3
    • asttokens: 2.2.1
    • async-timeout: 4.0.2
    • attrs: 23.1.0
    • autopep8: 2.0.2
    • babel: 2.12.1
    • backcall: 0.2.0
    • binaryornot: 0.4.4
    • black: 23.3.0
    • bokeh: 2.4.3
    • braceexpand: 0.1.7
    • brotli: 1.0.9
    • build: 0.10.0
    • cachetools: 5.3.1
    • certifi: 2022.12.7
    • cffi: 1.15.1
    • cfgv: 3.3.1
    • cftime: 1.6.2
    • chardet: 5.1.0
    • charset-normalizer: 3.1.0
    • click: 8.1.3
    • cloudpickle: 2.2.1
    • cmaes: 0.9.1
    • colorama: 0.4.6
    • colorlog: 6.7.0
    • comm: 0.1.3
    • contourpy: 1.0.7
    • cookiecutter: 2.1.1
    • coverage: 7.2.6
    • cruft: 2.15.0
    • cssselect2: 0.7.0
    • cycler: 0.11.0
    • darglint: 1.8.1
    • dask: 2023.3.2
    • debugpy: 1.6.7
    • decorator: 5.1.1
    • distlib: 0.3.6
    • distributed: 2023.3.2
    • docker-pycreds: 0.4.0
    • docutils: 0.18.1
    • entrypoints: 0.4
    • exceptiongroup: 1.1.1
    • execnet: 1.9.0
    • executing: 1.2.0
    • fastcore: 1.5.29
    • fasteners: 0.18
    • filelock: 3.12.0
    • flake8: 6.0.0
    • flake8-annotations: 3.0.1
    • flake8-black: 0.3.6
    • flake8-bugbear: 23.5.9
    • flake8-docstrings: 1.7.0
    • fonttools: 4.39.2
    • frozenlist: 1.3.3
    • fsspec: 2023.3.0
    • gitdb: 4.0.10
    • gitpython: 3.1.31
    • google-auth: 2.19.0
    • google-auth-oauthlib: 0.4.6
    • greenlet: 2.0.2
    • grpcio: 1.54.2
    • heapdict: 1.0.1
    • html5lib: 1.1
    • identify: 2.5.24
    • idna: 3.4
    • imagesize: 1.4.1
    • importlib-metadata: 6.1.0
    • importlib-resources: 5.12.0
    • iniconfig: 2.0.0
    • ipykernel: 6.23.1
    • ipython: 8.12.2
    • isort: 5.12.0
    • jedi: 0.18.2
    • jinja2: 3.1.2
    • jinja2-time: 0.2.0
    • joblib: 1.2.0
    • jupyter-client: 8.2.0
    • jupyter-core: 5.3.0
    • kiwisolver: 1.4.4
    • lightgbm: 3.3.5
    • lightning-utilities: 0.8.0
    • locket: 1.0.0
    • lovely-numpy: 0.2.9
    • lovely-tensors: 0.1.15
    • mako: 1.2.4
    • markdown: 3.4.3
    • markupsafe: 2.1.2
    • matplotlib: 3.7.1
    • matplotlib-inline: 0.1.6
    • mccabe: 0.7.0
    • msgpack: 1.0.5
    • multidict: 6.0.4
    • mypy: 1.3.0
    • mypy-extensions: 1.0.0
    • nest-asyncio: 1.5.6
    • netcdf4: 1.6.3
    • nodeenv: 1.8.0
    • numcodecs: 0.11.0
    • numpy: 1.20.3
    • oauthlib: 3.2.2
    • optuna: 3.1.1
    • packaging: 23.0
    • pandas: 1.3.5
    • parso: 0.8.3
    • partd: 1.3.0
    • pathspec: 0.11.1
    • pathtools: 0.1.2
    • pep8-naming: 0.13.3
    • pexpect: 4.8.0
    • pickleshare: 0.7.5
    • pillow: 9.4.0
    • pip: 23.0.1
    • platformdirs: 3.5.1
    • pluggy: 1.0.0
    • pre-commit: 3.3.2
    • prompt-toolkit: 3.0.38
    • protobuf: 3.20.3
    • psutil: 5.9.4
    • ptyprocess: 0.7.0
    • pure-eval: 0.2.2
    • py: 1.11.0
    • pyarrow: 11.0.0
    • pyasn1: 0.5.0
    • pyasn1-modules: 0.3.0
    • pycodestyle: 2.10.0
    • pycparser: 2.21
    • pydantic: 1.10.7
    • pydocstyle: 6.3.0
    • pydyf: 0.5.0
    • pyflakes: 3.0.1
    • pygments: 2.15.1
    • pyparsing: 3.0.9
    • pyphen: 0.14.0
    • pyproject-api: 1.5.1
    • pyproject-hooks: 1.0.0
    • pytest: 7.3.1
    • pytest-cov: 4.1.0
    • pytest-forked: 1.6.0
    • pytest-xdist: 3.3.1
    • python-dateutil: 2.8.2
    • python-slugify: 8.0.1
    • pytorch-lightning: 1.9.5
    • pytorch-resample: 0.1.0
    • pytz: 2023.2
    • pyyaml: 6.0
    • pyzmq: 25.1.0
    • requests: 2.28.2
    • requests-oauthlib: 1.3.1
    • rsa: 4.9
    • scikit-learn: 1.2.2
    • scipy: 1.8.1
    • sentry-sdk: 1.24.0
    • setproctitle: 1.3.2
    • setuptools: 67.7.2
    • six: 1.16.0
    • smmap: 5.0.0
    • snowballstemmer: 2.2.0
    • sortedcontainers: 2.4.0
    • sphinx: 6.2.1
    • sphinx-autodoc-typehints: 1.23.0
    • sphinx-rtd-theme: 1.2.1
    • sphinxcontrib-applehelp: 1.0.4
    • sphinxcontrib-devhelp: 1.0.2
    • sphinxcontrib-htmlhelp: 2.0.1
    • sphinxcontrib-jquery: 4.1
    • sphinxcontrib-jsmath: 1.0.1
    • sphinxcontrib-qthelp: 1.0.3
    • sphinxcontrib-serializinghtml: 1.1.5
    • sqlalchemy: 2.0.15
    • stack-data: 0.6.2
    • tblib: 1.7.0
    • tensorboard: 2.11.2
    • tensorboard-data-server: 0.6.1
    • tensorboard-plugin-wit: 1.8.1
    • text-unidecode: 1.3
    • threadpoolctl: 3.1.0
    • tinycss2: 1.2.1
    • tomli: 2.0.1
    • toolz: 0.12.0
    • torch: 1.12.1+cu113
    • torch-tb-profiler: 0.4.1
    • torchaudio: 0.12.1+cu113
    • torchmetrics: 0.11.4
    • torchsampler: 0.1.2
    • torchvision: 0.13.1+cu113
    • tornado: 6.2
    • tox: 4.5.2
    • tqdm: 4.65.0
    • traitlets: 5.9.0
    • typer: 0.9.0
    • types-docutils: 0.20.0.1
    • types-pyyaml: 6.0.12.10
    • types-requests: 2.28.11.17
    • types-setuptools: 67.3.0.2
    • types-urllib3: 1.26.25.13
    • typing-extensions: 4.5.0
    • urllib3: 1.26.15
    • versioneer: 0.22
    • virtualenv: 20.23.0
    • wandb: 0.14.2
    • watchdog: 3.0.0
    • wcwidth: 0.2.6
    • weasyprint: 58.1
    • webdataset: 0.2.48
    • webencodings: 0.5.1
    • werkzeug: 2.3.4
    • wheel: 0.40.0
    • xarray: 2023.1.0
    • xbatcher: 0.3.0
    • yarl: 1.9.2
    • zarr: 2.14.2
    • zict: 2.2.0
    • zipp: 3.15.0
    • zopfli: 0.2.2
  • System:
    • OS: Linux
    • architecture:
      • 64bit
      • ELF
    • processor: x86_64
    • python: 3.8.10
    • release: 5.8.0-48-generic
    • version: Image logging to tensorboard #54~20.04.1-Ubuntu SMP Sat Mar 20 13:40:25 UTC 2021

More info

(Might be Related to #490)

@noamsgl noamsgl added bug Something isn't working needs triage Waiting to be triaged by maintainers labels Jun 1, 2023
@noamsgl
Copy link
Author

noamsgl commented Jun 4, 2023

Fixed.

@noamsgl noamsgl closed this as not planned Won't fix, can't repro, duplicate, stale Jun 4, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working needs triage Waiting to be triaged by maintainers ver: 1.9.x
Projects
None yet
Development

No branches or pull requests

1 participant