Sonar image target detection using sample generation and Swin Transformer-YOLO network
-
Graphical Abstract
-
Abstract
Due to the high cost of data collection and limited experimental conditions, sonar images are often scarce and of poor quality, which hinders effective feature learning and limits the performance of existing detection methods. This paper proposes an improved YOLO model, i.e. Swin Transformer-cascaded group attention YOLO (STC-YOLO), for sonar image target detection, which integrates diffusion-based sample generation with a Swin Transformer and cascaded group attention (CGA) mechanism. First, stable diffusion is fine-tuned via LoRA and incorporate semantic features from the bootstrapping language-image pre-training text model to generate high-quality and diverse sonar images for dataset expansion. Then, Swin Transformer is introduced into the YOLOv8 backbone to enhance multi-scale feature extraction for small targets, while integrating the CGA mechanism into the C2f module to improve small object perception. Additionally, the skewed intersection-over-union (SIoU) loss function is utilized to better adapt to the complexities of underwater environments. Experimental results indicate that the trained generative model can produce diverse and realistic samples even in data-scarce scenarios. Compared to the original YOLOv8 model, the enhanced STC-YOLO model exhibits a 5% increase in detection accuracy and a 12.6% improvement in mean average precision, achieving high-precision detection of small underwater targets.
-
-