Create NeptuneHook mechanism for automatic metadata logging #1

AleksanderWWW · 2022-12-12T13:51:35Z

No description provided.

kshitij12345 · 2022-12-22T06:54:31Z

.github/workflows/ci.yml

@@ -22,7 +22,7 @@ jobs:
    runs-on: ${{ matrix.os }}
    strategy:
      matrix:
-        os: [ubuntu-latest, macos-latest, windows-latest]


Just curious, why not Windows?

https://detectron2.readthedocs.io/en/latest/tutorials/install.html detetectron2 doesn't support Windows

kshitij12345 · 2022-12-22T06:57:02Z

src/neptune_detectron2/impl/__init__.py

+
+        self.base_handler = self._run[base_namespace]
+
+        verify_type("run", self._run, Run)


I think this should be before getting a base_namespace handler?

kshitij12345 · 2022-12-22T07:01:18Z

src/neptune_detectron2/impl/__init__.py

+
+        self._run.sync()
+        self._run.stop()
+        self._clean_output_dir()


Is there a way we can do this more eagerly? (as soon as file is uploaded?)

One problem that I see is, for a very long training session which creates a lot of checkpoints, we are doubling the number of checkpoints saved (and if user has allocated a smaller storage then this lead to storage out of space issues).

We could move sync and clear to the _log_checkpoint method, so that it performs those actions after each checkpoint creation, but then we might face considerable performance impact. I guess it depends on what we want to save more - time or space

Could we do it like every Nth time we do that?

Sound feasible. Maybe something like checkpoint_sync_freq param passed by user (defaults to 1)?

I was thinking about this as our internal magical number being a tradeoff between doing it at the end vs doing it each time. The value would come from the domain knowledge of @kshitij12345 ;) How many checkpoints are we ok with storing before we need to clean the space.

An important thing to consider is that not all model checkpoints will have same sizes.

For example, from the model zoo, the checkpoint size of Faster RCNN with Resnet 50 is 135MB while Faster RCNN with Resnet 101 is 243MB. So, a single magic number most likely would not work.

It would be nice if upload() method actually returned a handle object which one can query if the upload is done and take decision based on that. (Something like Future from Python's async paradigm).

shall we have a meeting this week to discuss this matter? @kshitij12345 @Herudaio

Never mind, solved it. Check it out

kshitij12345

Thanks! Overall looks good, just one main question related to checkpointing.

kshitij12345 · 2022-12-28T07:20:42Z

src/neptune_detectron2/impl/__init__.py

+            return
+
+        self.trainer.checkpointer.save(f"neptune_iter_{self.trainer.iter}")
+        path = "model/checkpoints/checkpoint_{}"


Can we rename this to neptune_model_path or something to make it clear that this is for neptune. I was confused reading this.

kshitij12345 · 2022-12-28T07:23:35Z

src/neptune_detectron2/impl/__init__.py

+
+        self.trainer.checkpointer.save(f"neptune_iter_{self.trainer.iter}")
+        path = "model/checkpoints/checkpoint_{}"
+        if final:


Maybe condense the if to

path = path.format("final" if final else f"iter_{self.trainer.iter}")

kshitij12345 · 2022-12-28T07:23:56Z

src/neptune_detectron2/impl/__init__.py

+        else:
+            path = path.format(f"iter_{self.trainer.iter}")
+
+        if self.trainer.checkpointer.has_checkpoint():


Do we need this given that we call save above?

kshitij12345 · 2022-12-28T07:26:03Z

src/neptune_detectron2/impl/__init__.py

+
+            with open(checkpoint_path, "rb") as fp:
+                self._run[path] = File.from_stream(fp)
+            os.remove(checkpoint_path)


I am not sure about this. If we are in async mode then is it possible that the file will be removed while the stream is still being read?

I don't think so. Reading ends with exiting the context manager, since it closes the stream. When you hit the os.remove line the stream has been already copied to .neptune location from which it is asynchronously uploaded
and then immediately deleted

That makes sense then. Thank you!

tests/test_e2e.py

kshitij12345

LGTM! Thank you @AleksanderWWW

(Though I would recommend a review from engineering as well :) )

shnela · 2022-12-30T12:04:10Z

README.md

@@ -1,4 +1,4 @@
 # Neptune - detectron2 integration

 TODO: Update docs link


Now you can remove TODO ;)

shnela · 2022-12-30T12:04:49Z

pyproject.toml

@@ -16,11 +16,14 @@ importlib-metadata = { version = "*", python = "<3.8" }

 # TODO: Base requirements


TODO to remove if base requirements are satisfied.

shnela · 2022-12-30T12:07:48Z

tests/utils.py

+    img_dir = "./datasets/coco/train2014"
+    if not os.path.isdir(img_dir) or len(os.listdir(img_dir)) == 0:
+        os.makedirs(img_dir, exist_ok=True)
+        os.system("wget http://images.cocodataset.org/train2014/COCO_train2014_000000057870.jpg")


Why do we need images stored in the repo if they are downloaded dynamically?

If we don't need them, some gitignore entries might be a good idea.

AleksanderWWW added 26 commits December 12, 2022 14:50

add code for NeptuneHook

1c07e5b

apply pre-commit suggestions

fb2581e

make logging final model conditional

a028895

add e2e test

35336ab

add output dir to gitignore

81bdc4d

add torch to pyproject and pip installation of detectron to workflow

821b560

make custom_run_id a local variable

d604e9e

remove windows from workflow

2c75c71

add run syncing before assertions

e7484b2

give time to upload files

d9c2145

temporarily remove problematic assert to see how the rest goes

b8ba20a

explicitly pass run to NeptuneHook and call sync after training

65326e1

sync active run, not the closed one

567c8dc

sync before stoping run

78f86e0

change connecting with custom id to run id

c502ad5

force installing lower version of numpy

e6f21c4

add sync after uploading checkpoint

922ca97

increase number of epochs

0ade49b

fix checkpointing error

dfaec88

add removing checkpoint files after train (+sync before)

38fc29c

force lower version of fvcore

a13e8ea

force precise version of fvcore

1efc286

fix typo

de82c8c

bring back previous version specification of fvcore

44304e4

modularize the code - create private methods for individual activities

5465adf

verify type of config in _log_config method

c2fdbe8

kshitij12345 reviewed Dec 22, 2022

View reviewed changes

kshitij12345 suggested changes Dec 22, 2022

View reviewed changes

AleksanderWWW added 2 commits December 22, 2022 09:46

verityf type of run before creating base handler

c786969

fix checkpointing issue by uploading from stream

0ff6907

kshitij12345 reviewed Dec 28, 2022

View reviewed changes

apply review suggestions

56e1a7a

kshitij12345 reviewed Dec 28, 2022

View reviewed changes

tests/test_e2e.py Outdated Show resolved Hide resolved

test accuracy, not loss

27793e2

kshitij12345 approved these changes Dec 28, 2022

View reviewed changes

shnela suggested changes Dec 30, 2022

View reviewed changes

AleksanderWWW added 3 commits December 30, 2022 13:53

remove TODOs

24091e5

delete train images

a876116

add train images to gitignore

b101f49

shnela approved these changes Dec 30, 2022

View reviewed changes

AleksanderWWW merged commit 4326b81 into main Dec 30, 2022

AleksanderWWW deleted the aw/add-neptune-hook branch December 30, 2022 16:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Create NeptuneHook mechanism for automatic metadata logging #1

Create NeptuneHook mechanism for automatic metadata logging #1

AleksanderWWW commented Dec 12, 2022

kshitij12345 Dec 22, 2022

AleksanderWWW Dec 22, 2022

kshitij12345 Dec 22, 2022

kshitij12345 Dec 22, 2022

AleksanderWWW Dec 22, 2022 •

edited

Loading

Herudaio Dec 22, 2022

AleksanderWWW Dec 22, 2022 •

edited

Loading

Herudaio Dec 22, 2022

kshitij12345 Dec 22, 2022

AleksanderWWW Dec 26, 2022

AleksanderWWW Dec 27, 2022

kshitij12345 left a comment •

edited

Loading

kshitij12345 Dec 28, 2022

kshitij12345 Dec 28, 2022

kshitij12345 Dec 28, 2022

kshitij12345 Dec 28, 2022

AleksanderWWW Dec 28, 2022

kshitij12345 Dec 28, 2022

kshitij12345 left a comment

shnela Dec 30, 2022

shnela Dec 30, 2022

shnela Dec 30, 2022

shnela Dec 30, 2022


		self.base_handler = self._run[base_namespace]

		verify_type("run", self._run, Run)

		@@ -1,4 +1,4 @@
		# Neptune - detectron2 integration

		TODO: Update docs link

		@@ -16,11 +16,14 @@ importlib-metadata = { version = "*", python = "<3.8" }

		# TODO: Base requirements

Create NeptuneHook mechanism for automatic metadata logging #1

Create NeptuneHook mechanism for automatic metadata logging #1

Conversation

AleksanderWWW commented Dec 12, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

AleksanderWWW Dec 22, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

AleksanderWWW Dec 22, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kshitij12345 left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kshitij12345 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

AleksanderWWW Dec 22, 2022 •

edited

Loading

AleksanderWWW Dec 22, 2022 •

edited

Loading

kshitij12345 left a comment •

edited

Loading