-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
slow tpu train #2033
slow tpu train #2033
Conversation
Working fine on single tpu_core and specific tpu_core but now getting an error |
once this is merged: Let's get rid of .spawn and replace with the same method |
@rohitgr7 This usually happens when you have already selected a core i.e |
@lezwon mind add note to changelog? |
if self.use_tpu and self.tpu_id is None: | ||
device = xm.xla_device() | ||
if self.use_tpu: | ||
device = xm.xla_device(self.tpu_id) if self.tpu_id is not None else xm.xla_device() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
According to docs, the None case does not need to be handled explicitly:
https://pytorch.org/xla/release/1.5/index.html#torch_xla.core.xla_model.xla_device
device = xm.xla_device(self.tpu_id)
is fine even if self.tpu_id = None
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@awaelchli Yea, it does work without it too. We could refactor these instances across the project.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have refactored it here #1756 so if you do it here too, then I think it's pretty much all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it also work for older versions... =)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Borda Do you mean older versions of lightning? I've noticed xm.xla_device(None)
working earlier too. The docs @awaelchli pointed out though, confirms that it should work well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking of older version of XLA but there is no reason why it should not :]
This pull request is now in conflict... :( |
ce7f0b9
to
c09f2ec
Compare
This reverts commit 1fb6e58
Codecov Report
@@ Coverage Diff @@
## master #2033 +/- ##
======================================
Coverage 88% 88%
======================================
Files 74 74
Lines 4666 4666
======================================
Hits 4084 4084
Misses 582 582 |
@lezwon I ran directly with 8 cores and that error occured. I see you have made some changes in the PR. Will try again with updated version tonight and let you know. |
Getting this error now in all the cases, even when is set
|
the checkpoints trained with an old version (before hparams PR) are not compatible with the latest change to how hparams are stored in the checkpoints. This is a pending issue that needs to be resolved. |
I retrained the model from scratch. Even when setting |
Hi @rohitgr7 @lezwon @awaelchli
Is this issue solved yet? |
@Borda FYI we’re, working on fixing the hparams thing |
Hi @Borda @williamFalcon |
* use parallel loader * Revert "use parallel loader" This reverts commit ed6e758 * select tpu id for pl * condition if tpu_id is None * added info to changelog * Revert "condition if tpu_id is None" This reverts commit 1fb6e58 * Apply suggestions from code review Co-authored-by: Jirka Borovec <[email protected]>
Before submitting
What does this PR do?
Fixes #2016 .
PR review
Anyone in the community is free to review the PR once the tests have passed.
If we didn't discuss your PR in Github issues there's a high chance it will not be merged.
Did you have fun?
Make sure you had fun coding 🙃