Video instance segmentation without using mask and identity supervision
Published in IEEE Transactions on Multimedia, 2024
Video instance segmentation (VIS) is a challenging vision problem in which the task is to simultaneously detect, segment, and track all the object instances in a video. Most existing VIS approaches rely on pixel-level mask supervision within a frame as well as instance-level identity annotation across frames. However, obtaining these 'mask and identity' annotations is timeconsuming and expensive. We propose the first mask-identityfree VIS framework that neither utilizes mask annotations nor requires identity supervision. Accordingly, we introduce a query contrast and exchange network (QCEN) comprising instance query contrast and query-exchanged mask learning. The instance query contrast first performs cross-frame instance matching and then conducts query feature contrastive learning. The query-exchanged mask learning exploits both intra-video and inter-video query exchange properties: exchanging queries of an identical instance from different frames within a video results in consistent instance masks, whereas exchanging queries across videos results in all-zero background masks. Extensive experiments on three benchmarks (YouTube-VIS 2019, YouTube-VIS 2021, and OVIS) reveal the merits of the proposed approach, which significantly reduces the performance gap between the identify-free baseline and our maskidentify-free VIS method. On the YouTube-VIS 2019 validation set, our mask-identity-free approach achieves 91.4% of the strongersupervision-based baseline performance when utilizing the same ImageNet pre-trained model.
Recommended citation: Li Ge, Cao Jiale, Sun Hanqing, Anwer Rao M., Xie Jin, Khan Fahad, Pang Yanwei. Video instance segmentation without using mask and identity supervision. IEEE Transactions on Multimedia, 2025, 27: 224-235.
Download Paper