Hugging Face, a repository for machine learning, data sets, and AI tools, has released an open-source vision language model that is lightweight and built for efficiency and speed. Vision Language Models (VLM) can understand both text and visual input.
The model is available for commercial use with open training pipelines, which means the datasets, code, and methods used to train the model are available to the public. Hugging Face has three variants of the model - SmolVM-Base, SmolVM-Synthetic, and SmolVM Instruct.
SmolVM-Base is designed for downstream fine-tuning, meaning it can be adopted and trained for specific tasks. Synthetic is trained on artificial data and does not use real-world datasets, and Instruct can be "used out of the box for interactive end-user applications."
Hugging Face says SmolVM requires just 5.7GB of GPU RAM, making it smaller and more efficient than competitors like PaliGemma 3B, InternVL2 2B, and Qwen2-VL-2B. This allows it to run on laptops with limited VRAM.
It is also more token-efficient compared to other models. Tokens measure a model's speed and efficiency, and SmolVM can encode a 384x384 image in 81 tokens, compared to Qwen2-VL, which uses 16k tokens. The model also requires less computational power and RAM to get it running.
Hugging Face is hosting a demo built on SmolVM-Instruct with a supervised training script for anyone to try out.
Are you a techie who knows how to write? Then join our Team! Wanted:
- News Writer (Romania based)
Details here