Metas DINOv2 is a foundation model for computer vision. The company shows its strengths and wants to combine DINOv2 with large language models.
In May 2021, AI researchers at Meta presented DINO (Self-Distillation with no labels), a self-supervised trained AI model for image tasks such as classification or segmentation. With DINOv2, Meta is now releasing a significantly improved version.
Like DINO, DINOv2 is a computer vision model trained using self-supervised learning, and according to Meta, it performs as well as or better than most of today’s specialized systems on all benchmarked tasks. Due to self-supervised learning, no labeled data is required, and the DINO models can be trained on large unlabeled image datasets.
DINOv2 is a building block for all computer vision tasks
“DINOv2 provides high-performance features that can be directly used as inputs for simple linear classifiers,” says Meta. This makes DINOv2 a flexible and versatile tool for a wide range of computer vision tasks, from image-level visual tasks (image classification, instance retrieval, video understanding) to pixel-level visual tasks (depth estimation, semantic segmentation).
According to Meta, DINOv2 models could be useful for a variety of applications, including forest mapping with the World Resources Institute, estimating animal density and abundance, and biological research such as cell microscopy.
For training, Meta collected 1.2 billion images and filtered and balanced the dataset. In the end, DINOv2 was trained with 142 million images. Like its predecessor, DINOv2 relies on Vision Transformers.
Meta wants to link DINOv2 with large language models
DINOv2 complements Meta’s work in computer vision, in particular “Segment Anything“, a recently released model for zero-shot image segmentation with prompt capabilities. Meta sees DINOv2 as a building block that can be linked to other classifiers for use in many areas beyond segmentation.
The company is releasing the code and some models of the DINOv2 family. The company now plans to integrate DINOv2 into a more complex AI system that can interact with large language models. “A visual backbone providing rich information on images will allow complex AI systems to reason on images in a deeper way than describing them with a single text sentence.”
Models like CLIP, which would be trained with image-text pairs, would ultimately be limited by the captions, he said. “With DINOv2, there is no such built-in limitation.”