-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Create NeptuneHook mechanism for automatic metadata logging #1
Conversation
@@ -22,7 +22,7 @@ jobs: | |||
runs-on: ${{ matrix.os }} | |||
strategy: | |||
matrix: | |||
os: [ubuntu-latest, macos-latest, windows-latest] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just curious, why not Windows?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://detectron2.readthedocs.io/en/latest/tutorials/install.html detetectron2 doesn't support Windows
|
||
self.base_handler = self._run[base_namespace] | ||
|
||
verify_type("run", self._run, Run) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should be before getting a base_namespace handler?
|
||
self._run.sync() | ||
self._run.stop() | ||
self._clean_output_dir() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a way we can do this more eagerly? (as soon as file is uploaded?)
One problem that I see is, for a very long training session which creates a lot of checkpoints, we are doubling the number of checkpoints saved (and if user has allocated a smaller storage then this lead to storage out of space issues).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could move sync and clear to the _log_checkpoint method, so that it performs those actions after each checkpoint creation, but then we might face considerable performance impact. I guess it depends on what we want to save more - time or space
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we do it like every Nth time we do that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sound feasible. Maybe something like checkpoint_sync_freq
param passed by user (defaults to 1)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking about this as our internal magical number being a tradeoff between doing it at the end vs doing it each time. The value would come from the domain knowledge of @kshitij12345 ;) How many checkpoints are we ok with storing before we need to clean the space.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An important thing to consider is that not all model checkpoints will have same sizes.
For example, from the model zoo, the checkpoint size of Faster RCNN with Resnet 50 is 135MB while Faster RCNN with Resnet 101 is 243MB. So, a single magic number most likely would not work.
It would be nice if upload()
method actually returned a handle object which one can query if the upload is done and take decision based on that. (Something like Future
from Python's async paradigm).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shall we have a meeting this week to discuss this matter? @kshitij12345 @Herudaio
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Never mind, solved it. Check it out
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Overall looks good, just one main question related to checkpointing.
return | ||
|
||
self.trainer.checkpointer.save(f"neptune_iter_{self.trainer.iter}") | ||
path = "model/checkpoints/checkpoint_{}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we rename this to neptune_model_path
or something to make it clear that this is for neptune. I was confused reading this.
|
||
self.trainer.checkpointer.save(f"neptune_iter_{self.trainer.iter}") | ||
path = "model/checkpoints/checkpoint_{}" | ||
if final: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe condense the if
to
path = path.format("final" if final else f"iter_{self.trainer.iter}")
else: | ||
path = path.format(f"iter_{self.trainer.iter}") | ||
|
||
if self.trainer.checkpointer.has_checkpoint(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need this given that we call save
above?
|
||
with open(checkpoint_path, "rb") as fp: | ||
self._run[path] = File.from_stream(fp) | ||
os.remove(checkpoint_path) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure about this. If we are in async mode then is it possible that the file will be removed while the stream is still being read?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think so. Reading ends with exiting the context manager, since it closes the stream. When you hit the os.remove
line the stream has been already copied to .neptune
location from which it is asynchronously uploaded
and then immediately deleted
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That makes sense then. Thank you!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thank you @AleksanderWWW
(Though I would recommend a review from engineering as well :) )
README.md
Outdated
@@ -1,4 +1,4 @@ | |||
# Neptune - detectron2 integration | |||
|
|||
TODO: Update docs link |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now you can remove TODO ;)
pyproject.toml
Outdated
@@ -16,11 +16,14 @@ importlib-metadata = { version = "*", python = "<3.8" } | |||
|
|||
# TODO: Base requirements |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TODO to remove if base requirements are satisfied.
img_dir = "./datasets/coco/train2014" | ||
if not os.path.isdir(img_dir) or len(os.listdir(img_dir)) == 0: | ||
os.makedirs(img_dir, exist_ok=True) | ||
os.system("wget http://images.cocodataset.org/train2014/COCO_train2014_000000057870.jpg") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need images stored in the repo if they are downloaded dynamically?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we don't need them, some gitignore entries might be a good idea.
No description provided.