SimDINO: Simplifying DINO via Coding Rate Regularization

Ziyang Wu¹Jingyuan Zhang²Druv Pai¹Xudong Wang¹Chandan Singh³Jianwei Yang³Jianfeng Gao³Yi Ma^1,2,4

¹UC Berkeley²TranscEngram³Microsoft Research⁴HKU

Overview

We introduce SimDINO and SimDINOv2 by simplifying widely-used SSL algorithms (i.e. DINO and DINOv2) via coding rate regularization. These simplified algorithms lead to several key benefits:

Simplicity: Removing many empirically-selected components from the original DINO pipeline.
Robustness: More robust to variations in architecture, datasets and other configuations.
Performance: Leading to better learned represetntation and performance on downstream tasks.
Interpretability: Providing a theoretically motivated way to prevent representation collapse.

Overall, making design choices explicit and principled lead to simplicity and improved performance in vision SSL.

Main Results

Our experiments demonstrate that SimDINO and SimDINOv2 achieve superior performance compared to their original counterparts:

Quantitative Results

ImageNet Classification (ImageNet-1K)

Method	Model	Epochs	k-NN	Linear
DINO	ViT-B/16	100	72.9	76.3
SimDINO	ViT-B/16	100	74.9	77.3
DINOv2	ViT-B/16	100	76.0	77.2
SimDINOv2	ViT-B/16	100	78.1	79.7
DINOv2	ViT-L/16	100	80.8	82.0
SimDINOv2	ViT-L/16	100	81.1	82.4

Unsupervised Object Detection (COCO val2017)

Method	Model	AP50(↑)	AP75(↑)	AP(↑)
SimDINO	ViT-L/16	5.4	1.9	2.4
SimDINO	ViT-B/16	5.2	2.0	2.5
DINO	ViT-B/16	3.9	1.5	1.8
DINO	ViT-B/8	5.1	2.3	2.5

Unsupervised Instance Segmentation (COCO val2017)

Method	Model	AP50(↑)	AP75(↑)	AP(↑)
SimDINO	ViT-L/16	4.5	1.4	1.9
SimDINO	ViT-B/16	4.7	1.5	2.0
DINO	ViT-B/16	3.1	1.0	1.4
DINO	ViT-B/8	4.1	1.3	1.8

Semantic Segmentation (ADE20K)

Method	Model	mIoU(↑)	mAcc(↑)
DINO	ViT-B/16	33.1	41.9
SimDINO	ViT-B/16	33.7	42.8
DINOv2	ViT-B/16	32.5	41.4
SimDINOv2	ViT-B/16	36.9	46.5
DINOv2	ViT-L/16	41.0	50.8
SimDINOv2	ViT-L/16	41.8	52.2

Video Object Segmentation (DAVIS-2017)

Method	Model	(J&F)m(↑)	Jm(↑)	Fm(↑)
DINO	ViT-B/16	63.0	61.5	64.4
SimDINO	ViT-B/16	63.0	61.6	64.4
DINOv2	ViT-B/16	53.2	52.7	53.7
SimDINOv2	ViT-B/16	60.9	60.4	61.4
DINOv2	ViT-L/16	62.0	61.7	62.3
SimDINOv2	ViT-L/16	62.6	61.9	63.3

Feature Visualizations

Our models exhibit strong emergent properties similar to the original DINO families. Below we show different types of visualizations demonstrating the models' capabilities:

Attention Maps

Average self-attention maps showing object-centric attention patterns

PCA Visualization

Top three principal components of key features in RGB format

Saliency Maps

Regions attended by the [CLS] token

Optimization Dynamics

We visualize the training dynamics of training ViT-B/16 under SimDINO and DINO. X-axis denotes the training epochs while Y-axis indicates the k-NN performance on ImageNet-1K.

Left: Both models are trained on ImageNet-1K. We omit the earlier epochs for better visualization.
Right: Both models are trained on COCO train2017 (roughly 1/10th of ImageNet-1K). As a verification experiment, this shows that SimDINO requires substantially less tuning and much easier to optimize.

Approach

The core insight of our work is that representation collapse can be prevented directly through coding rate regularization, eliminating the need for many empirically-designed components. Take DINO as an example:

It turns out we can remove many empirically-designed components and replace them with a simple coding rate term in the loss function.

Removing Empirically-Designed Components:

A weight-normalized linear layer that projects features to high-dim (~65,536) outputs for clustering.
Balancing operations to prevent representation collapse (e.g., centering, sharpening).
Miscellaneous hyperparameters (e.g., temperature schedules, centering momentum).

Opting for Simple, Explicit Solutions:

Feature alignment loss $\mathcal{L}_{\text{align}}$ via simple Euclidean distance: $d_{\ell^{2}}(z_1,z_2) = \frac{1}{2}\|z_1 - z_2\|^2$ .
Anti-collapse loss $\mathcal{L}_{\text{rate}}$ based on explicit coding rate regularization: $R_{\epsilon}(\Gamma) := \frac{1}{2}\text{logdet}({I + \frac{d}{\epsilon^{2}}\Gamma}),$ where $\Gamma = \text{Cov}(Z)$ is the covariance of global-view features computed on a batch of images, $d$ is feature dimension and $\epsilon$ denotes the distortion error (Refer to the paper for more specifics).
Combining these two losses leads to the complete objective $\mathcal{L} = \mathcal{L}_{\text{align}} + \lambda \mathcal{L}_{\text{rate}}$ , where the balancing factor $\lambda$ can be explicitly derived based on gradient norm (See Appendix C).

This leads to a much simpler pipeline we term SimDINO:

This streamlined design makes the models more robust to architectural choices and hyperparameter settings while achieving better downstream performance. The coding rate term provides a theoretically motivated way to prevent collapse, replacing many empirical design choices in the original DINO pipeline. Similar simplifications can be made to DINOv2, leading to SimDINOv2. Refer to the paper for more details.

Citation

@article{wu2025simdino,
  title={Simplifying DINO via Coding Rate Regularization},
  author={Ziyang Wu, Jingyuan Zhang, Druv Pai, Xudong Wang, Chandan Singh, Jianwei Yang, Jianfeng Gao, Yi Ma},
  booktitle={arXiv preprint arXiv:2502.10385},
  year={2025}
}