13 September 2025

Advancing Segment Anything Model for Efficient Salient Object Detection in Remote Sensing Images

Salient object detection in optical remote sensing images (ORSIs-SOD) often relies on leveraging pretrained knowledge from natural images to achieve high accuracy with limited training data. Traditional methods typically employ vision backbones [e.g., convolutional neural networks (CNNs) or vision transformers (ViTs)] pretrained on ImageNet to extract features from ORSI scenes. However, these backbones exhibit limited generalization across diverse scenarios compared to recent vision foundation models. To this end, we propose ORSI-segment anything model (SAM), a novel ORSI-SOD framework based on the SAM, leveraging its superior generalization capabilities to achieve an exceptional efficiency–accuracy tradeoff. Specifically, ORSI-SAM adopts lightweight SAM as the backbone, effectively reducing parameter size and computational overhead to enable efficient deployment on satellite devices while retaining the rich knowledge learned from large-scale natural image datasets. To mitigate the impact of unavailable prompts in ORSI-SOD on the prediction capability of the SAM decoder, we introduce a hierarchical interaction prompt generator (HIPG), which aggregates hierarchical features and generates mask prompts tailored for salient objects to guide the decoder in producing high-quality saliency maps. Furthermore, to address the recognition challenges caused by the inherent characteristics of ORSIs, we propose a semantic-aware refinement decoder (SARD). SARD integrates structural details from low-level features to enrich fine-grained object information while leveraging high-level features to suppress redundant interference in shallow layers, thereby improving the detailed information in the predicted saliency map. ORSI-SAM is the first work to explore the accuracy–efficiency tradeoffs for ORSI-SOD based on SAM architecture. Extensive experiments on benchmark datasets show that ORSI-SAM achieves superior performance compared to recent state-of-the-art methods with 12.2 M parameters and 8.9 G FLOPs.