Multispectral remote sensing object detection via selective cross-modal interaction and aggregation

Published in Neural Networks, 2025

Multispectral remote sensing object detection plays a vital role in a wide range of geoscience and remote sensing applications, such as environmental monitoring and disaster monitoring, by leveraging complementary information from RGB and infrared modalities. The performance of such systems heavily relies on the effective fusion of information across these modalities. A key challenge lies in capturing meaningful cross-modal long-range dependencies that aid object localization and identification, while simultaneously suppressing noise and irrelevant information during feature fusion to enhance the discriminative quality of the fused representations. To address these challenges, we propose a novel framework, termed Selective cross-modal Interaction and Aggregation (SIA), which comprises two key components: the Selective Cross-modal Interaction (SCI) module and the Selective Feature Aggregation (SFA) module. The SCI module addresses the inefficiency of traditional cross-modal attention mechanisms by selectively prioritizing the most informative long-range dependencies. This significantly reduces computational costs while maintaining high detection accuracy. The SFA module utilizes a gating mechanism to effectively filter out noise and redundant information introduced by equal-weight fusion, thereby yielding more discriminative feature representations. Comprehensive experiments are conducted on the challenging multispectral remote sensing object detection benchmark DroneVehicle, as well as two additional multispectral urban object detection datasets, M3FD and LLVIP. The proposed approach consistently achieves superior detection accuracy across all datasets. Notably, on the DroneVehicle test set, our method outperforms the recently introduced C2Former by 2.8% mAP@0.5, while incurring lower computational cost.

Recommended citation: Cui Minghao, Nie Jing, Sun Hanqing, Xie Jin, Cao Jiale, Pang Yanwei, Li Xuelong. Multispectral remote sensing object detection via selective cross-modal interaction and aggregation. Neural Networks, 2025: 108533.
Download Paper