-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Early stopping conditioned on metric val_loss
isn't recognised when setting the val_check_interval
#490
Comments
can you post your test_end step? |
I didn't use a test set since it is optional. The default MNIST example in the README will reproduce the behaviour when changing the trainer line to:
|
sorry, meant validation_end |
def validation_end(self, outputs):
# OPTIONAL
avg_loss = torch.stack([x['val_loss'] for x in outputs]).mean()
tensorboard_logs = {'val_loss': avg_loss}
return {'avg_val_loss': avg_loss, 'log': tensorboard_logs} I tried changing 'avg_val_loss' -> 'val_loss' but the same issue occurs. |
it should be val_loss |
I tried it with
The issue still occurs. The issue only doesn't happen when using the default |
ok got it. can you share the stacktrace? |
There is no error just a warning at the end of epoch 3 and then training stops.
|
It looks like the problem is that there is only one If it is true, we can just force validation computation at the end of the training epoch. |
@kuynzereb we shouldn't force computation. just partition self.callback_metrics to have self.callback_metrics['train'] = {}
self.callback_metrics['val'] = {}
self.callback_metrics['test'] = {} anyone interested in the PR? |
I created a PR #492 but made a simple change to update self.callback_metrics instead as then it won't require changes to the |
@williamFalcon @ryanwongsa |
FYI, I was still having this issue which I traced to not having enough trainer.overfit_pc to check relative to my batch-size and num gpus. validation sanity checks and validation end seemed to get skipped (if I ran without early stopping) thereby not returning my loss metrics dict. solved purely by increasing overfit_pc. |
I encountered the same problem when I set def check_metrics(self, logs):
monitor_val = logs.get(self.monitor)
error_msg = (f'Early stopping conditioned on metric `{self.monitor}`'
f' which is not available. Available metrics are:'
f' `{"`, `".join(list(logs.keys()))}`')
if monitor_val is None:
if self.strict:
raise RuntimeError(error_msg)
if self.verbose > 0:
rank_zero_warn(error_msg, RuntimeWarning)
return False
return True And if the I think the problem is that Instead, a good solution is to make |
`EarlyStopping` should check the metric of interest `on_validation_end` rather than `on_epoch_end`. In a normal scenario, this does not cause a problem, but in combination with `check_val_every_n_epoch>1` in the `Trainer` it results in a warning or in a `RuntimeError` depending on `strict`.
`EarlyStopping` should check the metric of interest `on_validation_end` rather than `on_epoch_end`. In a normal scenario, this does not cause a problem, but in combination with `check_val_every_n_epoch>1` in the `Trainer` it results in a warning or in a `RuntimeError` depending on `strict`.
`EarlyStopping` should check the metric of interest `on_validation_end` rather than `on_epoch_end`. In a normal scenario, this does not cause a problem, but in combination with `check_val_every_n_epoch>1` in the `Trainer` it results in a warning or in a `RuntimeError` depending on `strict`.
* Fixes #490 `EarlyStopping` should check the metric of interest `on_validation_end` rather than `on_epoch_end`. In a normal scenario, this does not cause a problem, but in combination with `check_val_every_n_epoch>1` in the `Trainer` it results in a warning or in a `RuntimeError` depending on `strict`. * Highlighted that ES callback runs on val epochs in docstring * Updated EarlyStopping in rst doc * Update early_stopping.py * Update early_stopping.rst * Update early_stopping.rst * Update early_stopping.rst * Update early_stopping.rst * Apply suggestions from code review Co-authored-by: Adrian Wälchli <[email protected]> * Update docs/source/early_stopping.rst * fix doctest indentation warning * Train loop calls early_stop.on_validation_end * chlog Co-authored-by: William Falcon <[email protected]> Co-authored-by: Jirka Borovec <[email protected]> Co-authored-by: Adrian Wälchli <[email protected]> Co-authored-by: Jirka <[email protected]>
I had a same problem, I did a dummy sampling (ex.: X_test = X_test[10000:11000] for the validation set with too high numbers, it wasn't that long, caused an totally empty set, and of course the NN was not able to validate about the nothing. Maybe a Warning message would be good, if the test/train sets are empty. |
Describe the bug
Training stops when setting
val_check_interval
<1.0 in the Trainer class as it doesn't recogniseval_loss
. I get the following warning at the end of the 3rd epoch:To Reproduce
Steps to reproduce the behavior:
trainer
line totrainer = Trainer(val_check_interval=0.5,default_save_path="test")
Expected behavior
Training shouldn't stop and
val_loss
should be recognised.Desktop (please complete the following information):
Additional context
This doesn't happen with 0.5.2.1 although it looks like something has changed with model saving mechanism since it only seems to save the best model in 0.5.3.2.
EDIT: Also seems to happen when setting
train_percent_check
<1.0The text was updated successfully, but these errors were encountered: