SimDINO: Simplifying DINO via Coding Rate Regularization
We introduce SimDINO and SimDINOv2 by simplifying widely-used SSL algorithms (i.e. DINO and DINOv2) via coding rate regularization. These simplified algorithms lead to several key benefits:
Overall, making design choices explicit and principled lead to simplicity and improved performance in vision SSL.
Our experiments demonstrate that SimDINO and SimDINOv2 achieve superior performance compared to their original counterparts:
Method | Model | Epochs | k-NN | Linear |
---|---|---|---|---|
DINO | ViT-B/16 | 100 | 72.9 | 76.3 |
SimDINO | ViT-B/16 | 100 | 74.9 | 77.3 |
DINOv2 | ViT-B/16 | 100 | 76.0 | 77.2 |
SimDINOv2 | ViT-B/16 | 100 | 78.1 | 79.7 |
DINOv2 | ViT-L/16 | 100 | 80.8 | 82.0 |
SimDINOv2 | ViT-L/16 | 100 | 81.1 | 82.4 |
Method | Model | AP50(↑) | AP75(↑) | AP(↑) |
---|---|---|---|---|
SimDINO | ViT-L/16 | 5.4 | 1.9 | 2.4 |
SimDINO | ViT-B/16 | 5.2 | 2.0 | 2.5 |
DINO | ViT-B/16 | 3.9 | 1.5 | 1.8 |
DINO | ViT-B/8 | 5.1 | 2.3 | 2.5 |
Method | Model | AP50(↑) | AP75(↑) | AP(↑) |
---|---|---|---|---|
SimDINO | ViT-L/16 | 4.5 | 1.4 | 1.9 |
SimDINO | ViT-B/16 | 4.7 | 1.5 | 2.0 |
DINO | ViT-B/16 | 3.1 | 1.0 | 1.4 |
DINO | ViT-B/8 | 4.1 | 1.3 | 1.8 |
Method | Model | mIoU(↑) | mAcc(↑) |
---|---|---|---|
DINO | ViT-B/16 | 33.1 | 41.9 |
SimDINO | ViT-B/16 | 33.7 | 42.8 |
DINOv2 | ViT-B/16 | 32.5 | 41.4 |
SimDINOv2 | ViT-B/16 | 36.9 | 46.5 |
DINOv2 | ViT-L/16 | 41.0 | 50.8 |
SimDINOv2 | ViT-L/16 | 41.8 | 52.2 |
Method | Model | (J&F)m(↑) | Jm(↑) | Fm(↑) |
---|---|---|---|---|
DINO | ViT-B/16 | 63.0 | 61.5 | 64.4 |
SimDINO | ViT-B/16 | 63.0 | 61.6 | 64.4 |
DINOv2 | ViT-B/16 | 53.2 | 52.7 | 53.7 |
SimDINOv2 | ViT-B/16 | 60.9 | 60.4 | 61.4 |
DINOv2 | ViT-L/16 | 62.0 | 61.7 | 62.3 |
SimDINOv2 | ViT-L/16 | 62.6 | 61.9 | 63.3 |
Our models exhibit strong emergent properties similar to the original DINO families. Below we show different types of visualizations demonstrating the models' capabilities:
Average self-attention maps showing object-centric attention patterns
Top three principal components of key features in RGB format
Regions attended by the [CLS] token
We visualize the training dynamics of training ViT-B/16 under SimDINO and DINO. X-axis denotes the training epochs while Y-axis indicates the k-NN performance on ImageNet-1K.
The core insight of our work is that representation collapse can be prevented directly through coding rate regularization, eliminating the need for many empirically-designed components. Take DINO as an example:
It turns out we can remove many empirically-designed components and replace them with a simple coding rate term in the loss function.
Removing Empirically-Designed Components:
Opting for Simple, Explicit Solutions:
This leads to a much simpler pipeline we term SimDINO:
This streamlined design makes the models more robust to architectural choices and hyperparameter settings while achieving better downstream performance. The coding rate term provides a theoretically motivated way to prevent collapse, replacing many empirical design choices in the original DINO pipeline. Similar simplifications can be made to DINOv2, leading to SimDINOv2. Refer to the paper for more details.
@article{wu2025simdino,
title={Simplifying DINO via Coding Rate Regularization},
author={Ziyang Wu, Jingyuan Zhang, Druv Pai, Xudong Wang, Chandan Singh, Jianwei Yang, Jianfeng Gao, Yi Ma},
booktitle={arXiv preprint arXiv:2502.10385},
year={2025}
}
powered by Academic Project Page Template