Abstract: Large vision-language models (VLMs) have made significant progress in the field of response generation. However, these models also face more potential security risks due to diverse media ...