-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
First comit to add video instance segmentation #27
base: main
Are you sure you want to change the base?
Conversation
Hey @fardinayar, thank you for the PR. We will have a look at this asap. Do you have any suggestion for a small video dataset to use to test the conversion outside of the added tests? |
Hi again, The main reason for this early PR is to check if the idea aligns with your repo and if it’s something you'd consider. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall the approach is great! I left a few suggestions. I'd need to get more into the video formats and tasks myself. I think my main concern is that we should make sure that we have a good structure for the data objects so we can easily support all the tasks (including tracking).
Let me know what you think :)
|
||
import cv2 | ||
import numpy as np | ||
import pycocotools.mask as mask_utils |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the long run we would like to avoid using pycocotools. But for this PR it's fine. We can make a follow up PR to address this. Same goes for opencv.
For this PR you'd there also have to add cv2 and pycocotools to the poetry.toml file.
|
||
|
||
|
||
def _youtube_vis_segmentation_to_multipolygon( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's move this to a new folder and file src/labelformat/utils/segmentation.py
return MultiPolygon(polygons=polygons) | ||
|
||
|
||
def _mask_to_polygons(mask: np.ndarray) -> List[np.ndarray]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also move this to utils/segmentation.py
return [contour.squeeze() for contour in contours if len(contour) >= 3] | ||
|
||
|
||
def _multipolygon_to_youtube_vis_segmentation( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also move this to utils/segmentation.py
youtube_vis_segmentation.append(rle) | ||
return youtube_vis_segmentation | ||
|
||
def _get_output_videos_dict( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also move this to utils/segmentation.py
] | ||
|
||
|
||
def _get_output_categories_dict( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also move this to utils/segmentation.py
|
||
|
||
@dataclass(frozen=True) | ||
class SingleVideoInstanceSegmentation: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you have a bunch of SingleVideoInstanceSegmentation
it would be hard to figure out how they are connected. Shouldn't we add a track id? Otherwise we would need to use the index in the list or so. I think adding a track id might be easier and less error prone.
I'd also suggest to add the frame number to each of the objects.
I'd suggest something like this instead:
@dataclass(frozen=True)
class FrameSegmentation:
frame: int
segmentation: MultiPolygon
@dataclass(frozen=True)
class SingleVideoInstanceSegmentation:
track_id: int
category: Category
frame_segmentations: List[FrameSegmentation]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I expected something similar when I first used YouTube VIS (YVIS) datasets. However, YVIS uses an approach without track IDs. More specifically, each video in YVIS has a list of annotations for each object with the size 'video_length.' This means that if an object disappears in a frame, its corresponding segmentation is saved as 'null.'
Since the most popular VIS dataset is YVIS, and many other VIS datasets like OVIS follow the exact same pattern, I thought it was a good idea. It also has the same logit as 'SingleInstanceSegmentation' and 'ImageInstanceSegmentation'. So, we can assume masks are 3D in VIS, with the extra dimension being time.
Anyway, some other datasets, like KITTI-MOTS, use the track-id format similar to what you have proposed.
Let me also implement KITTI-MOTS, and we can reconsider which format is better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, thanks a lot for the explanation. I guess the important point is then what we will use as the internal format.
We designed labelformat to load and write based on an internal representation. Based on what you describe we could pick either one, YVIS or KITTI-MOTS.
I would personally prefer to use something well-defined as an internal representation and then when we load a dataset we convert to that format and might have to make assumptions.
For examples for object detection this internal format is composed of:
@dataclass(frozen=True)
class ImageObjectDetection:
image: Image
objects: List[SingleObjectDetection]
Which is composed of
@dataclass(frozen=True)
class Image:
id: int
filename: str
width: int
height: int
@dataclass(frozen=True)
class SingleObjectDetection:
category: Category
box: BoundingBox
@dataclass(frozen=True)
class BoundingBox:
xmin: float
ymin: float
xmax: float
ymax: float
So if we load YOLO format datasets we would need to extract the image width and height as well.
Now for video datasets we just need to find a suitable internal representation. The rest should then be straightforward. It's also not a big deal if we would have to change this in the future. As I'm not familiar with the formats I'd trust the pick on you :)
obj1[key], obj2[key], rel=rel, abs=abs, nan_ok=nan_ok | ||
) | ||
if 'counts' in obj1: #For RLE encodded segmentations | ||
import pycocotools.mask as mask_utils |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could pack this into a separate method:
def _compare_rle_masks(mask1: dict, mask2: dict, tolerance: int = 5) -> None:
import pycocotools.mask as mask_utils
arr1 = mask_utils.decode(mask1)
arr2 = mask_utils.decode(mask2)
assert arr1.shape == arr2.shape, "RLE masks have different shapes"
difference = np.abs(arr1 - arr2).sum()
assert difference <= tolerance, f"RLE masks differ beyond tolerance: {difference} > {tolerance}"
@@ -0,0 +1,81 @@ | |||
import json |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest adding a helper function to integration/integration_utils.py
that helps normalizing the json structure for comparison.
Something like this:
import copy
def normalize_json(data: dict, schema: str = "vis") -> dict:
normalized = copy.deepcopy(data)
if schema == "vis":
normalized.pop("info", None)
normalized.pop("licenses", None)
for category in normalized.get("categories", []):
category.pop("supercategory", None)
for video in normalized.get("videos", []):
video.pop("license", None)
for annotation in normalized.get("annotations", []):
for key in ["areas", "bboxes", "length", "occlusion"]:
annotation.pop(key, None)
elif schema == "coco":
normalized.pop("info", None)
normalized.pop("licenses", None)
for category in normalized.get("categories", []):
category.pop("supercategory", None)
for image in normalized.get("images", []):
for key in ["date_captured", "license", "flickr_url", "coco_url"]:
image.pop(key, None)
for annotation in normalized.get("annotations", []):
for key in ["id", "area"]:
annotation.pop(key, None)
return normalized
Then in this file you use the helper for easier testing. AFAIK we want to check only the important parts and ignore the rest.
assert output_data["annotations"] # Ensure annotations are not empty | ||
|
||
|
||
def test_youtube_vis_to_youtube_vis(tmp_path: Path) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you add the helper described earlier you could turn this into:
def test_youtube_vis_to_youtube_vis(tmp_path: Path) -> None:
label_input = YouTubeVISInput(input_file=REAL_DATA_FILE)
output_file = tmp_path / "annotations_train.json"
YouTubeVISOutput(output_file=output_file).save(label_input=label_input)
output_json = json.loads(output_file.read_text())
expected_json = json.loads(REAL_DATA_FILE.read_text())
normalized_output = normalize_json(output_json, schema="vis")
normalized_expected = normalize_json(expected_json, schema="vis")
assert_almost_equal_recursive(normalized_output, normalized_expected)
Thank you for your detailed feedback. I will make sure to apply all of them and get back to you soon. |
Hi,
Thank you for your clean and well-organized code!
I'm adding a new feature to support video instance segmentation, as it's necessary for one of my projects, and I would be honored to contribute to this project.
My goals are:
You can view the current status in my forked repository.
If you believe I can merge this into the main repository in the future, would you create a new branch for "VIS"?