-
Notifications
You must be signed in to change notification settings - Fork 28.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Fix] text-classification PL example #6027
Conversation
Codecov Report
@@ Coverage Diff @@
## master #6027 +/- ##
==========================================
+ Coverage 77.46% 78.50% +1.04%
==========================================
Files 146 146
Lines 26243 26243
==========================================
+ Hits 20330 20603 +273
+ Misses 5913 5640 -273
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why wasn't any of this breaking tests? Is there some test coverage we could add to find these sorts of errors earlier?
@@ -23,7 +23,7 @@ mkdir -p $OUTPUT_DIR | |||
# Add parent directory to python path to access lightning_base.py | |||
export PYTHONPATH="../":"${PYTHONPATH}" | |||
|
|||
python3 run_pl_glue.py --data_dir $DATA_DIR \ | |||
python3 run_pl_glue.py --gpus 2 --data_dir $DATA_DIR \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should 2 gpus be the default?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it should not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So should I add environment variables in the shell script just like $DATA_DIR etc.?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DATA_DIR is solid. My only objection is gpus=2.
@@ -72,7 +76,7 @@ def prepare_data(self): | |||
logger.info("Saving features into cached file %s", cached_features_file) | |||
torch.save(features, cached_features_file) | |||
|
|||
def load_dataset(self, mode, batch_size): | |||
def get_dataloader(self, mode, batch_size): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
get_dataloader
has one more arg - shuffle
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, i think wrapping this load_dataset() into a get_dataloader() is the best way to handle this.
Besides, looking at the output for examples test: it's impossible to tell which examples are being run and which aren't. It will only indicate the name on failure. Perhaps at the very least adding a pytest option so that it can announce which tests were run? Submitted PR to do just that: #6035 |
Here is a PR that adds the missing PL glue test: #6034 (which obviously fails by CI - a good thing). |
@stas00 you can at least see all the files that are run with |
you want one dir up as well, so:
but it tells only part of the story, since most info is hidden in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Excited for this!
Co-authored-by: Sam Shleifer <[email protected]>
How does this break tf tests? Looks like the model save still has issues with the state_dict its saving. |
TF failures are spurious. |
Merging this, thanks @bhashithe, @stas00 @laibamehnaz and everyone else who helped! |
The merged #6027 broke yeah, PL already has |
Let's continue the discussion here: #6310 |
The text-classification example needed to have several edits to get it to working. The main one was that the hparams are loaded as a dict instead of a Namespace object from the checkpoint so this needed to be fixed with recasting the hparams to a Namespace object.
Though this is not the ideal solution, it works for now.
I also have some other fixes such as gpus argument which needed to be added to the generic arguments list in the
lightning_base.py
and removing the default value forn_tpu_cores
. And the lr_scheduler was not accessed by the logging callback correctly. These have been all fixed and the example works correctly.