Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create NeptuneHook mechanism for automatic metadata logging #1

Merged
merged 33 commits into from
Dec 30, 2022

Conversation

AleksanderWWW
Copy link
Contributor

No description provided.

@@ -22,7 +22,7 @@ jobs:
runs-on: ${{ matrix.os }}
strategy:
matrix:
os: [ubuntu-latest, macos-latest, windows-latest]

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just curious, why not Windows?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


self.base_handler = self._run[base_namespace]

verify_type("run", self._run, Run)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be before getting a base_namespace handler?


self._run.sync()
self._run.stop()
self._clean_output_dir()

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a way we can do this more eagerly? (as soon as file is uploaded?)

One problem that I see is, for a very long training session which creates a lot of checkpoints, we are doubling the number of checkpoints saved (and if user has allocated a smaller storage then this lead to storage out of space issues).

Copy link
Contributor Author

@AleksanderWWW AleksanderWWW Dec 22, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could move sync and clear to the _log_checkpoint method, so that it performs those actions after each checkpoint creation, but then we might face considerable performance impact. I guess it depends on what we want to save more - time or space

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we do it like every Nth time we do that?

Copy link
Contributor Author

@AleksanderWWW AleksanderWWW Dec 22, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sound feasible. Maybe something like checkpoint_sync_freq param passed by user (defaults to 1)?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking about this as our internal magical number being a tradeoff between doing it at the end vs doing it each time. The value would come from the domain knowledge of @kshitij12345 ;) How many checkpoints are we ok with storing before we need to clean the space.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An important thing to consider is that not all model checkpoints will have same sizes.

For example, from the model zoo, the checkpoint size of Faster RCNN with Resnet 50 is 135MB while Faster RCNN with Resnet 101 is 243MB. So, a single magic number most likely would not work.

It would be nice if upload() method actually returned a handle object which one can query if the upload is done and take decision based on that. (Something like Future from Python's async paradigm).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shall we have a meeting this week to discuss this matter? @kshitij12345 @Herudaio

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Never mind, solved it. Check it out

Copy link

@kshitij12345 kshitij12345 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Overall looks good, just one main question related to checkpointing.

return

self.trainer.checkpointer.save(f"neptune_iter_{self.trainer.iter}")
path = "model/checkpoints/checkpoint_{}"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we rename this to neptune_model_path or something to make it clear that this is for neptune. I was confused reading this.


self.trainer.checkpointer.save(f"neptune_iter_{self.trainer.iter}")
path = "model/checkpoints/checkpoint_{}"
if final:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe condense the if to

path = path.format("final" if final else f"iter_{self.trainer.iter}")

else:
path = path.format(f"iter_{self.trainer.iter}")

if self.trainer.checkpointer.has_checkpoint():

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this given that we call save above?


with open(checkpoint_path, "rb") as fp:
self._run[path] = File.from_stream(fp)
os.remove(checkpoint_path)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure about this. If we are in async mode then is it possible that the file will be removed while the stream is still being read?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so. Reading ends with exiting the context manager, since it closes the stream. When you hit the os.remove line the stream has been already copied to .neptune location from which it is asynchronously uploaded
and then immediately deleted

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense then. Thank you!

Copy link

@kshitij12345 kshitij12345 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thank you @AleksanderWWW

(Though I would recommend a review from engineering as well :) )

README.md Outdated
@@ -1,4 +1,4 @@
# Neptune - detectron2 integration

TODO: Update docs link
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now you can remove TODO ;)

pyproject.toml Outdated
@@ -16,11 +16,14 @@ importlib-metadata = { version = "*", python = "<3.8" }

# TODO: Base requirements
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO to remove if base requirements are satisfied.

img_dir = "./datasets/coco/train2014"
if not os.path.isdir(img_dir) or len(os.listdir(img_dir)) == 0:
os.makedirs(img_dir, exist_ok=True)
os.system("wget http://images.cocodataset.org/train2014/COCO_train2014_000000057870.jpg")
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need images stored in the repo if they are downloaded dynamically?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we don't need them, some gitignore entries might be a good idea.

@AleksanderWWW AleksanderWWW merged commit 4326b81 into main Dec 30, 2022
@AleksanderWWW AleksanderWWW deleted the aw/add-neptune-hook branch December 30, 2022 16:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants