ScribbleBox: Interactive Annotation Framework for Video Object Segmentation

Bowen Chen* 1
Huan Ling* 1,2,3
Xiaohui Zeng1,2
Jun Gao1,2,3
Ziyue Xu1
Sanja Fidler1,2,3

1University of Toronto
2Vector Institute
3NVIDIA
ECCV, 2020



Manually labeling video datasets for segmentation tasks is extremely time consuming. In this paper, we introduce ScribbleBox, a novel interactive framework for annotating object instances with masks in videos. In particular, we split annotation into two steps: annotating objects with tracked boxes, and labeling masks inside these tracks. We introduce automation and interaction in both steps. Box tracks are annotated efficiently by approximating the trajectory using a parametric curve with a small number of control points which the annotator can interactively correct. Our approach tolerates a modest amount of noise in the box placements, thus typically only a few clicks are needed to annotate tracked boxes to a sufficient accuracy. Segmentation masks are corrected via scribbles which are efficiently propagated through time. We show significant performance gains in annotation efficiency over past work. We show that our ScribbleBox approach reaches 88.92% J&F on DAVIS2017 with 9.14 clicks per box track, and 4 frames of scribble annotation.


Paper

Bowen Chen*, Huan Ling*, Xiaohui Zeng, Jun Gao,
Ziyue Xu, Sanja Fidler

ScribbleBox: Interactive Annotation Framework for
Video Object Segmentation
ECCV, 2020. (to appear)

[Preprint]
[Bibtex]
[Toronto Annotation Suite]


Results


Qualitative results on DAVIS2017 val set using our full annotation framework. 5 rounds of scribble correction + 8.85 box corrections were used.

Qualitative results on MOTS-KITTI using our full annotation framework. 5 rounds of scribble correction + 11.2 box corrections were used.

Real annotation examples using our annotation tool in practice on the EPIC-Kitchen dataset.
Each object in a 100-frame video requiring on average 67.1s of annotation time (including inference time). The first column in each two rows indicates target objects.


We gratefully acknowledge NVIDIA corporation for the donation of GPUs used for this research. This webpage template was borrowed from Richard Zhang.