
Do I need to finetune my VLM?
Vision-Language Models (VLMs) represent a fascinating intersection of computer vision and natural language processing. The combination of a Vision Encoder with a LLM has sparked interest in the computer vision field to use its capabilities for zero-shot tasks, where using traditional methods lack. Although expensive for large data throughputs, the image + prompt input makes VLMs versatile tools for visual question answering, captioning and various other tasks. SmolVLM architecture (Image from https://huggingface....