Motion Expressions Guided Video Segmentation Via Effective Motion Information Mining

Published in IEEE Transactions on Emerging Topics in Computational Intelligence, 2025

Motion expressions guided video segmentation is aimed to segment objects in videos according to the given language descriptions about object motion. To accurately segment moving objects across frames, it is important to capture motion information of objects within the entire video. However, the existing method fails to encode object motion information accurately. In this paper, we propose an effective motion information mining framework to improve motion expressions guided video segmentation, named EMIM. It consists of two novel modules, including a hierarchical motion aggregation module and a box-level positional encoding module. Specifically, the hierarchical motion aggregation module is aimed to capture local and global temporal information of objects within a video. To achieve this goal, we introduce local-window self-attention and selective state space models for short-term and long-term feature aggregation. Inspired by that the spatial changes of objects can effectively reflect the object motion across frames, the box-level positional encoding module integrates object spatial information into object embeddings. With two proposed modules, our proposed method can capture object spatial changes with temporal evolution. We conduct the extensive experiments on motion expressions guided video segmentation dataset MeViS to reveal the advantages of our EMIM. Our proposed EMIM achieves a J&F score of 42.2%, outperforming the prior approach, LMPM, by 5.0%.

Recommended citation: Li Ge, Sun Hanqing, Yang Aiping, Cao Jiale, and Pang Yanwei. Motion Expressions Guided Video Segmentation Via Effective Motion Information Mining. IEEE Transactions on Emerging Topics in Computational Intelligence, 2025.
Download Paper

Share on

Bluesky Facebook LinkedIn X (formerly Twitter)

Hanqing Sun

Share on