-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Collect all the training_step outputs for training epoch end #2354
Collect all the training_step outputs for training epoch end #2354
Conversation
@@ -690,7 +694,7 @@ def run_training_batch(self, batch, batch_idx): | |||
signal=0, | |||
grad_norm_dic=grad_norm_dic, | |||
batch_log_metrics=batch_log_metrics, | |||
training_step_output_for_epoch_end=opt_closure_result.training_step_output_for_epoch_end | |||
training_step_output_for_epoch_end=all_training_step_output_for_epoch_end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i believe the function that calls this method also needs to be fixed to take into account the list of outputs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
training_step_output_for_epoch_end=opt_closure_result.training_step_output_for_epoch_end
this line only capture the last iteration outputs and discard the previous iterations.
@@ -690,7 +694,7 @@ def run_training_batch(self, batch, batch_idx): | |||
signal=0, | |||
grad_norm_dic=grad_norm_dic, | |||
batch_log_metrics=batch_log_metrics, | |||
training_step_output_for_epoch_end=opt_closure_result.training_step_output_for_epoch_end | |||
training_step_output_for_epoch_end=training_step_output_for_epoch_end_list |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this here is now a list of outputs. This means we need to make sure the method which called this processes that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sounds good 😄
wait... i was thinking about this. I still think this PR is not correct. We don't want the output from EVERY tbptt split... only the last one which is what this code does today |
In that case, we will miss some outputs, which will lead to the wrong metric calculation at the end of the epoch. isn't it? |
How is it going here? |
I couldn't able to figure out the failed tests issue. However, I am using this PR changes in my local machine and it is working perfectly. |
This pull request is now in conflict... :( |
@Borda and @williamFalcon guys did you get the chance to look into this issue? I was using this PR version and it is working properly at my end. I didn't find the cause of this conflict. |
is there still the issue, as #2320 is closed? |
Finished in #2890 |
Fixes #2320
Bug Fixes of #2320 after the #2328 commits