New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

First comit to add video instance segmentation #27

Draft

fardinayar wants to merge 1 commit into lightly-ai:main from fardinayar:vis

fardinayar commented Feb 10, 2025

Hi,
Thank you for your clean and well-organized code!
I'm adding a new feature to support video instance segmentation, as it's necessary for one of my projects, and I would be honored to contribute to this project.
My goals are:

Add conversion between VIS datasets like YouTube VIS, KITTI MOTS, and others.
Add conversion from VIS datasets to image datasets.
Follow the general principles of your code as well as writing unit and integration tests.
You can view the current status in my forked repository.
If you believe I can merge this into the main repository in the future, would you create a new branch for "VIS"?


          First comit to add video instance segmentation

ba3f0f7

Contributor

IgorSusmelj commented Feb 10, 2025

Hey @fardinayar, thank you for the PR. We will have a look at this asap. Do you have any suggestion for a small video dataset to use to test the conversion outside of the added tests?

Author

fardinayar commented Feb 10, 2025 •

edited

Loading

Hi again,
Thanks for giving it a chance!
Actually, I haven't tested it except with datasets in the YouTubeVIS format, but I plan to test it with the KITTI MOTS dataset, and I will send you a small version of both datasets.
For now, I'm still working on it, and there are a few things to do before it's fully ready. In the meantime, I would appreciate any suggestions in implementations.

The main reason for this early PR is to check if the idea aligns with your repo and if it’s something you'd consider.

IgorSusmelj reviewed

View reviewed changes

Contributor

IgorSusmelj left a comment

Overall the approach is great! I left a few suggestions. I'd need to get more into the video formats and tasks myself. I think my main concern is that we should make sure that we have a good structure for the data objects so we can easily support all the tasks (including tracking).

Let me know what you think :)

src/labelformat/formats/youtube_vis.py

+              import cv2
+              import numpy as np
+              import pycocotools.mask as mask_utils

Contributor

IgorSusmelj Feb 12, 2025

In the long run we would like to avoid using pycocotools. But for this PR it's fine. We can make a follow up PR to address this. Same goes for opencv.

For this PR you'd there also have to add cv2 and pycocotools to the poetry.toml file.

src/labelformat/formats/youtube_vis.py




		def _youtube_vis_segmentation_to_multipolygon(

Contributor

IgorSusmelj Feb 12, 2025

Let's move this to a new folder and file src/labelformat/utils/segmentation.py

src/labelformat/formats/youtube_vis.py

		return MultiPolygon(polygons=polygons)


		def _mask_to_polygons(mask: np.ndarray) -> List[np.ndarray]:

Contributor

IgorSusmelj Feb 12, 2025

Also move this to utils/segmentation.py

src/labelformat/formats/youtube_vis.py

		return [contour.squeeze() for contour in contours if len(contour) >= 3]


		def _multipolygon_to_youtube_vis_segmentation(

Contributor

IgorSusmelj Feb 12, 2025

Also move this to utils/segmentation.py

src/labelformat/formats/youtube_vis.py

+                      youtube_vis_segmentation.append(rle)
+                  return youtube_vis_segmentation
+              def _get_output_videos_dict(

Contributor

IgorSusmelj Feb 12, 2025

Also move this to utils/segmentation.py

src/labelformat/formats/youtube_vis.py

		]


		def _get_output_categories_dict(

Contributor

IgorSusmelj Feb 12, 2025

Also move this to utils/segmentation.py

src/labelformat/model/video_instance_segmentation.py



		@dataclass(frozen=True)
		class SingleVideoInstanceSegmentation:

Contributor

IgorSusmelj Feb 12, 2025

If you have a bunch of SingleVideoInstanceSegmentation it would be hard to figure out how they are connected. Shouldn't we add a track id? Otherwise we would need to use the index in the list or so. I think adding a track id might be easier and less error prone.

I'd also suggest to add the frame number to each of the objects.

I'd suggest something like this instead:

@dataclass(frozen=True)
class FrameSegmentation:
    frame: int 
    segmentation: MultiPolygon

@dataclass(frozen=True)
class SingleVideoInstanceSegmentation:
    track_id: int
    category: Category
    frame_segmentations: List[FrameSegmentation]

Author

fardinayar Feb 13, 2025

Yes, I expected something similar when I first used YouTube VIS (YVIS) datasets. However, YVIS uses an approach without track IDs. More specifically, each video in YVIS has a list of annotations for each object with the size 'video_length.' This means that if an object disappears in a frame, its corresponding segmentation is saved as 'null.'
Since the most popular VIS dataset is YVIS, and many other VIS datasets like OVIS follow the exact same pattern, I thought it was a good idea. It also has the same logit as 'SingleInstanceSegmentation' and 'ImageInstanceSegmentation'. So, we can assume masks are 3D in VIS, with the extra dimension being time.
Anyway, some other datasets, like KITTI-MOTS, use the track-id format similar to what you have proposed.
Let me also implement KITTI-MOTS, and we can reconsider which format is better.

Contributor

IgorSusmelj Feb 17, 2025

Ok, thanks a lot for the explanation. I guess the important point is then what we will use as the internal format.
We designed labelformat to load and write based on an internal representation. Based on what you describe we could pick either one, YVIS or KITTI-MOTS.
I would personally prefer to use something well-defined as an internal representation and then when we load a dataset we convert to that format and might have to make assumptions.

For examples for object detection this internal format is composed of:

@dataclass(frozen=True)
class ImageObjectDetection:
    image: Image
    objects: List[SingleObjectDetection]

Which is composed of

@dataclass(frozen=True)
class Image:
    id: int
    filename: str
    width: int
    height: int

@dataclass(frozen=True)
class SingleObjectDetection:
    category: Category
    box: BoundingBox

@dataclass(frozen=True)
class BoundingBox:
    xmin: float
    ymin: float
    xmax: float
    ymax: float

So if we load YOLO format datasets we would need to extract the image width and height as well.

Now for video datasets we just need to find a suitable internal representation. The rest should then be straightforward. It's also not a big deal if we would have to change this in the future. As I'm not familiar with the formats I'd trust the pick on you :)

tests/integration/integration_utils.py

-                              obj1[key], obj2[key], rel=rel, abs=abs, nan_ok=nan_ok
-                          )
+                      if 'counts' in obj1: #For RLE encodded segmentations
+                          import pycocotools.mask as mask_utils

Contributor

IgorSusmelj Feb 12, 2025

You could pack this into a separate method:

def _compare_rle_masks(mask1: dict, mask2: dict, tolerance: int = 5) -> None:
    import pycocotools.mask as mask_utils
    arr1 = mask_utils.decode(mask1)
    arr2 = mask_utils.decode(mask2)
    assert arr1.shape == arr2.shape, "RLE masks have different shapes"
    difference = np.abs(arr1 - arr2).sum()
    assert difference <= tolerance, f"RLE masks differ beyond tolerance: {difference} > {tolerance}"

tests/integration/video_instance_segmentation/test_youtube_vis.py

		@@ -0,0 +1,81 @@
		import json

Contributor

IgorSusmelj Feb 12, 2025

I suggest adding a helper function to integration/integration_utils.py that helps normalizing the json structure for comparison.

Something like this:

import copy

def normalize_json(data: dict, schema: str = "vis") -> dict:
    normalized = copy.deepcopy(data)
    if schema == "vis":
        normalized.pop("info", None)
        normalized.pop("licenses", None)
        for category in normalized.get("categories", []):
            category.pop("supercategory", None)
        for video in normalized.get("videos", []):
            video.pop("license", None)
        for annotation in normalized.get("annotations", []):
            for key in ["areas", "bboxes", "length", "occlusion"]:
                annotation.pop(key, None)
    elif schema == "coco":
        normalized.pop("info", None)
        normalized.pop("licenses", None)
        for category in normalized.get("categories", []):
            category.pop("supercategory", None)
        for image in normalized.get("images", []):
            for key in ["date_captured", "license", "flickr_url", "coco_url"]:
                image.pop(key, None)
        for annotation in normalized.get("annotations", []):
            for key in ["id", "area"]:
                annotation.pop(key, None)
    return normalized

Then in this file you use the helper for easier testing. AFAIK we want to check only the important parts and ignore the rest.

tests/integration/video_instance_segmentation/test_youtube_vis.py

		assert output_data["annotations"] # Ensure annotations are not empty


		def test_youtube_vis_to_youtube_vis(tmp_path: Path) -> None:

Contributor

IgorSusmelj Feb 12, 2025

If you add the helper described earlier you could turn this into:

def test_youtube_vis_to_youtube_vis(tmp_path: Path) -> None:
    label_input = YouTubeVISInput(input_file=REAL_DATA_FILE)
    output_file = tmp_path / "annotations_train.json"
    YouTubeVISOutput(output_file=output_file).save(label_input=label_input)

    output_json = json.loads(output_file.read_text())
    expected_json = json.loads(REAL_DATA_FILE.read_text())
    
    normalized_output = normalize_json(output_json, schema="vis")
    normalized_expected = normalize_json(expected_json, schema="vis")
    
    assert_almost_equal_recursive(normalized_output, normalized_expected)

Author

fardinayar commented Feb 13, 2025

Thank you for your detailed feedback. I will make sure to apply all of them and get back to you soon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet