Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

First comit to add video instance segmentation #27

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

fardinayar
Copy link

Hi,
Thank you for your clean and well-organized code!
I'm adding a new feature to support video instance segmentation, as it's necessary for one of my projects, and I would be honored to contribute to this project.
My goals are:

  1. Add conversion between VIS datasets like YouTube VIS, KITTI MOTS, and others.
  2. Add conversion from VIS datasets to image datasets.
  3. Follow the general principles of your code as well as writing unit and integration tests.
    You can view the current status in my forked repository.
    If you believe I can merge this into the main repository in the future, would you create a new branch for "VIS"?

@IgorSusmelj
Copy link
Contributor

Hey @fardinayar, thank you for the PR. We will have a look at this asap. Do you have any suggestion for a small video dataset to use to test the conversion outside of the added tests?

@fardinayar
Copy link
Author

fardinayar commented Feb 10, 2025

Hi again,
Thanks for giving it a chance!
Actually, I haven't tested it except with datasets in the YouTubeVIS format, but I plan to test it with the KITTI MOTS dataset, and I will send you a small version of both datasets.
For now, I'm still working on it, and there are a few things to do before it's fully ready. In the meantime, I would appreciate any suggestions in implementations.

The main reason for this early PR is to check if the idea aligns with your repo and if it’s something you'd consider.

Copy link
Contributor

@IgorSusmelj IgorSusmelj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall the approach is great! I left a few suggestions. I'd need to get more into the video formats and tasks myself. I think my main concern is that we should make sure that we have a good structure for the data objects so we can easily support all the tasks (including tracking).

Let me know what you think :)


import cv2
import numpy as np
import pycocotools.mask as mask_utils
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the long run we would like to avoid using pycocotools. But for this PR it's fine. We can make a follow up PR to address this. Same goes for opencv.

For this PR you'd there also have to add cv2 and pycocotools to the poetry.toml file.




def _youtube_vis_segmentation_to_multipolygon(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's move this to a new folder and file src/labelformat/utils/segmentation.py

return MultiPolygon(polygons=polygons)


def _mask_to_polygons(mask: np.ndarray) -> List[np.ndarray]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also move this to utils/segmentation.py

return [contour.squeeze() for contour in contours if len(contour) >= 3]


def _multipolygon_to_youtube_vis_segmentation(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also move this to utils/segmentation.py

youtube_vis_segmentation.append(rle)
return youtube_vis_segmentation

def _get_output_videos_dict(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also move this to utils/segmentation.py

]


def _get_output_categories_dict(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also move this to utils/segmentation.py



@dataclass(frozen=True)
class SingleVideoInstanceSegmentation:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you have a bunch of SingleVideoInstanceSegmentation it would be hard to figure out how they are connected. Shouldn't we add a track id? Otherwise we would need to use the index in the list or so. I think adding a track id might be easier and less error prone.

I'd also suggest to add the frame number to each of the objects.

I'd suggest something like this instead:

@dataclass(frozen=True)
class FrameSegmentation:
    frame: int 
    segmentation: MultiPolygon

@dataclass(frozen=True)
class SingleVideoInstanceSegmentation:
    track_id: int
    category: Category
    frame_segmentations: List[FrameSegmentation]

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I expected something similar when I first used YouTube VIS (YVIS) datasets. However, YVIS uses an approach without track IDs. More specifically, each video in YVIS has a list of annotations for each object with the size 'video_length.' This means that if an object disappears in a frame, its corresponding segmentation is saved as 'null.'
Since the most popular VIS dataset is YVIS, and many other VIS datasets like OVIS follow the exact same pattern, I thought it was a good idea. It also has the same logit as 'SingleInstanceSegmentation' and 'ImageInstanceSegmentation'. So, we can assume masks are 3D in VIS, with the extra dimension being time.
Anyway, some other datasets, like KITTI-MOTS, use the track-id format similar to what you have proposed.
Let me also implement KITTI-MOTS, and we can reconsider which format is better.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, thanks a lot for the explanation. I guess the important point is then what we will use as the internal format.
We designed labelformat to load and write based on an internal representation. Based on what you describe we could pick either one, YVIS or KITTI-MOTS.
I would personally prefer to use something well-defined as an internal representation and then when we load a dataset we convert to that format and might have to make assumptions.

For examples for object detection this internal format is composed of:

@dataclass(frozen=True)
class ImageObjectDetection:
    image: Image
    objects: List[SingleObjectDetection]

Which is composed of

@dataclass(frozen=True)
class Image:
    id: int
    filename: str
    width: int
    height: int

@dataclass(frozen=True)
class SingleObjectDetection:
    category: Category
    box: BoundingBox

@dataclass(frozen=True)
class BoundingBox:
    xmin: float
    ymin: float
    xmax: float
    ymax: float

So if we load YOLO format datasets we would need to extract the image width and height as well.

Now for video datasets we just need to find a suitable internal representation. The rest should then be straightforward. It's also not a big deal if we would have to change this in the future. As I'm not familiar with the formats I'd trust the pick on you :)

obj1[key], obj2[key], rel=rel, abs=abs, nan_ok=nan_ok
)
if 'counts' in obj1: #For RLE encodded segmentations
import pycocotools.mask as mask_utils
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could pack this into a separate method:

def _compare_rle_masks(mask1: dict, mask2: dict, tolerance: int = 5) -> None:
    import pycocotools.mask as mask_utils
    arr1 = mask_utils.decode(mask1)
    arr2 = mask_utils.decode(mask2)
    assert arr1.shape == arr2.shape, "RLE masks have different shapes"
    difference = np.abs(arr1 - arr2).sum()
    assert difference <= tolerance, f"RLE masks differ beyond tolerance: {difference} > {tolerance}"

@@ -0,0 +1,81 @@
import json
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest adding a helper function to integration/integration_utils.py that helps normalizing the json structure for comparison.

Something like this:

import copy

def normalize_json(data: dict, schema: str = "vis") -> dict:
    normalized = copy.deepcopy(data)
    if schema == "vis":
        normalized.pop("info", None)
        normalized.pop("licenses", None)
        for category in normalized.get("categories", []):
            category.pop("supercategory", None)
        for video in normalized.get("videos", []):
            video.pop("license", None)
        for annotation in normalized.get("annotations", []):
            for key in ["areas", "bboxes", "length", "occlusion"]:
                annotation.pop(key, None)
    elif schema == "coco":
        normalized.pop("info", None)
        normalized.pop("licenses", None)
        for category in normalized.get("categories", []):
            category.pop("supercategory", None)
        for image in normalized.get("images", []):
            for key in ["date_captured", "license", "flickr_url", "coco_url"]:
                image.pop(key, None)
        for annotation in normalized.get("annotations", []):
            for key in ["id", "area"]:
                annotation.pop(key, None)
    return normalized

Then in this file you use the helper for easier testing. AFAIK we want to check only the important parts and ignore the rest.

assert output_data["annotations"] # Ensure annotations are not empty


def test_youtube_vis_to_youtube_vis(tmp_path: Path) -> None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you add the helper described earlier you could turn this into:

def test_youtube_vis_to_youtube_vis(tmp_path: Path) -> None:
    label_input = YouTubeVISInput(input_file=REAL_DATA_FILE)
    output_file = tmp_path / "annotations_train.json"
    YouTubeVISOutput(output_file=output_file).save(label_input=label_input)

    output_json = json.loads(output_file.read_text())
    expected_json = json.loads(REAL_DATA_FILE.read_text())
    
    normalized_output = normalize_json(output_json, schema="vis")
    normalized_expected = normalize_json(expected_json, schema="vis")
    
    assert_almost_equal_recursive(normalized_output, normalized_expected)

@fardinayar
Copy link
Author

Thank you for your detailed feedback. I will make sure to apply all of them and get back to you soon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants