Do I need to finetune my VLM?

Vision-Language Models (VLMs) represent a fascinating intersection of computer vision and natural language processing. The combination of a Vision Encoder with a LLM has sparked interest in the computer vision field to use its capabilities for zero-shot tasks, where using traditional methods lack. Although expensive for large data throughputs, the image + prompt input makes VLMs versatile tools for visual question answering, captioning and various other tasks. SmolVLM architecture (Image from https://huggingface....

August 24, 2025

Starting my own Blog

Welcome to my blog. Since I spent a lot of time with the latest and greatest lately, I want you to learn from this experience: Vision-Language Modelling research (strengths, bias, training) VLMs vs Computer Vision Saving not the $$$ but our planet …

December 19, 2024