FusionKD: Fusion knowledge distillation of vision-language foundation model for strip steel surface defect detection
Accurate surface defect detection in strip steel is vital for industrial quality control but challenges existing methods in real-time performance and generalization. While Vision-Language Foundation Models (VLMs) offer superior recognition, their computational cost hinders deployment. This paper presents FusionKD, a novel knowledge distillation framework that transfers rich multimodal knowledge from a frozen large-scale VLM to a highly efficient, vision-only student detector. To bridge the architectural gap between the teacher and student, FusionKD introduces three key technical contributions: first, a cross-modal fusion distillation module that establishes bidirectional alignment between visual features and linguistic embeddings, enabling the student to assimilate semantic knowledge without text input; second, a cross-head word-region alignment mechanism that enhances the student’s ability to learn fine-grained spatial-semantic associations akin to the teacher’s reasoning; and third, a fused knowledge distillation loss formulated around the Pearson correlation coefficient to ensure stable training by optimizing feature correlation and mitigating optimization instability. To address the inherent optimization instability, we further propose a Dynamic Knowledge Coordination (DKC) framework that stabilizes training through phase-adaptive scheduling, gradient conflict resolution, and adaptive temperature annealing. Extensive experiments on the NEU-DET dataset show that FusionKD achieves 77.8–80.0 mAP with a 3.8 × speedup and 5.3 × parameter reduction over the teacher model, while maintaining ≤ 2.6 % accuracy degradation. The integrated DKC framework provides consistent performance gains, validating its efficacy in mitigating optimization instability caused by capacity disparity. Cross-dataset validation on PCB and GC10-DET further confirms its superior generalization capability.