SimDINO: Simplifying DINO via Coding Rate Regularization

1UC Berkeley2TranscEngram3Microsoft Research4HKU

Overview

We introduce SimDINO and SimDINOv2 by simplifying widely-used SSL algorithms (i.e. DINO and DINOv2) via coding rate regularization. These simplified algorithms lead to several key benefits:

  • Simplicity: Removing many empirically-selected components from the original DINO pipeline.
  • Robustness: More robust to variations in architecture, datasets and other configuations.
  • Performance: Leading to better learned represetntation and performance on downstream tasks.
  • Interpretability: Providing a theoretically motivated way to prevent representation collapse.

Overall, making design choices explicit and principled lead to simplicity and improved performance in vision SSL.

Main Results

Our experiments demonstrate that SimDINO and SimDINOv2 achieve superior performance compared to their original counterparts:

Quantitative Results

ImageNet Classification (ImageNet-1K)

MethodModelEpochsk-NNLinear
DINOViT-B/1610072.976.3
SimDINOViT-B/1610074.977.3
DINOv2ViT-B/1610076.077.2
SimDINOv2ViT-B/1610078.179.7
DINOv2ViT-L/1610080.882.0
SimDINOv2ViT-L/1610081.182.4

Unsupervised Object Detection (COCO val2017)

MethodModelAP50(↑)AP75(↑)AP(↑)
SimDINOViT-L/165.41.92.4
SimDINOViT-B/165.22.02.5
DINOViT-B/163.91.51.8
DINOViT-B/85.12.32.5

Unsupervised Instance Segmentation (COCO val2017)

MethodModelAP50(↑)AP75(↑)AP(↑)
SimDINOViT-L/164.51.41.9
SimDINOViT-B/164.71.52.0
DINOViT-B/163.11.01.4
DINOViT-B/84.11.31.8

Semantic Segmentation (ADE20K)

MethodModelmIoU(↑)mAcc(↑)
DINOViT-B/1633.141.9
SimDINOViT-B/1633.742.8
DINOv2ViT-B/1632.541.4
SimDINOv2ViT-B/1636.946.5
DINOv2ViT-L/1641.050.8
SimDINOv2ViT-L/1641.852.2

Video Object Segmentation (DAVIS-2017)

MethodModel(J&F)m(↑)Jm(↑)Fm(↑)
DINOViT-B/1663.061.564.4
SimDINOViT-B/1663.061.664.4
DINOv2ViT-B/1653.252.753.7
SimDINOv2ViT-B/1660.960.461.4
DINOv2ViT-L/1662.061.762.3
SimDINOv2ViT-L/1662.661.963.3

Feature Visualizations

Our models exhibit strong emergent properties similar to the original DINO families. Below we show different types of visualizations demonstrating the models' capabilities:

Attention Maps

Attention Maps

Average self-attention maps showing object-centric attention patterns

PCA Visualization

PCA Visualization

Top three principal components of key features in RGB format

Saliency Maps

Saliency Maps

Regions attended by the [CLS] token

Optimization Dynamics

We visualize the training dynamics of training ViT-B/16 under SimDINO and DINO. X-axis denotes the training epochs while Y-axis indicates the k-NN performance on ImageNet-1K.

  • Left: Both models are trained on ImageNet-1K. We omit the earlier epochs for better visualization.
  • Right: Both models are trained on COCO train2017 (roughly 1/10th of ImageNet-1K). As a verification experiment, this shows that SimDINO requires substantially less tuning and much easier to optimize.

Approach

The core insight of our work is that representation collapse can be prevented directly through coding rate regularization, eliminating the need for many empirically-designed components. Take DINO as an example:

It turns out we can remove many empirically-designed components and replace them with a simple coding rate term in the loss function.

Removing Empirically-Designed Components:

  • A weight-normalized linear layer that projects features to high-dim (~65,536) outputs for clustering.
  • Balancing operations to prevent representation collapse (e.g., centering, sharpening).
  • Miscellaneous hyperparameters (e.g., temperature schedules, centering momentum).

Opting for Simple, Explicit Solutions:

  • Feature alignment loss Lalign\mathcal{L}_{\text{align}} via simple Euclidean distance: d2(z1,z2)=12z1z22d_{\ell^{2}}(z_1,z_2) = \frac{1}{2}\|z_1 - z_2\|^2.
  • Anti-collapse loss Lrate\mathcal{L}_{\text{rate}} based on explicit coding rate regularization: Rϵ(Γ):=12logdet(I+dϵ2Γ),R_{\epsilon}(\Gamma) := \frac{1}{2}\text{logdet}({I + \frac{d}{\epsilon^{2}}\Gamma}), where Γ=Cov(Z)\Gamma = \text{Cov}(Z) is the covariance of global-view features computed on a batch of images, dd is feature dimension and ϵ\epsilon denotes the distortion error (Refer to the paper for more specifics).
  • Combining these two losses leads to the complete objective L=Lalign+λLrate\mathcal{L} = \mathcal{L}_{\text{align}} + \lambda \mathcal{L}_{\text{rate}}, where the balancing factor λ\lambda can be explicitly derived based on gradient norm (See Appendix C).

This leads to a much simpler pipeline we term SimDINO:

This streamlined design makes the models more robust to architectural choices and hyperparameter settings while achieving better downstream performance. The coding rate term provides a theoretically motivated way to prevent collapse, replacing many empirical design choices in the original DINO pipeline. Similar simplifications can be made to DINOv2, leading to SimDINOv2. Refer to the paper for more details.

Citation

@article{wu2025simdino,
  title={Simplifying DINO via Coding Rate Regularization},
  author={Ziyang Wu, Jingyuan Zhang, Druv Pai, Xudong Wang, Chandan Singh, Jianwei Yang, Jianfeng Gao, Yi Ma},
  booktitle={arXiv preprint arXiv:2502.10385},
  year={2025}
}