Web-SSL: Scaling Language-Free Visual Representation Learning

Official inference code for the Web-SSL model family introduced in: Scaling Language-Free Visual Representation Learning.

David Fan*, Shengbang Tong*, Jiachen Zhu, Koustuv Sinha, Zhuang Liu, Xinlei Chen, Michael Rabbat, Nicolas Ballas, Yann LeCun, Amir Bar†, Saining Xie†

FAIR Meta, New York University, Princeton University
*equal contribution, †equal advising

Overview

Web-SSL explores the scaling potential of visual self-supervised learning (SSL) on web-scale data. By scaling model size and training data, we show that vision-only models can match and even surpass language-supervised methods like CLIP, challenging the prevailing assumption that language supervision is necessary to learn strong visual representations for multimodal modeling. We present Web-SSL: a family of vision-only models, ranging from 0.3B to 7B parameters, that offers a strong alternative to CLIP for both multimodal modeling and classic vision tasks.

Key findings:

📈 SSL improves continuously with both model capacity and data.
🔍 Web-SSL matches or exceeds language-supervised methods on a wide range of VQA tasks—even on language-related tasks like OCR & Chart understanding, which were traditionally dominated by CLIP.
🖼️ Our models maintain competitive performance on classic vision tasks like classification and segmentation while excelling at multimodal tasks.
📊 Visual SSL methods are sensitive to data distribution! Training on filtered datasets with a higher concentration of text-rich images substantially improves OCR & Chart understanding.

Our Models

We provide our model weights in both HuggingFace and native PyTorch format. Please see the Usage section for sample model loading and inference code.

Web-DINO Models

Standard Models

Web-DINO is a family of DINOv2 models ranging from 0.3B to 7B parameters trained on larger scale web images. Web-DINO models especially excel at multimodal tasks such as VQA, without sacrificing performance in classic vision tasks such as image classification. Please see our paper for full details.

Model	Patch Size	Resolution	Data	HuggingFace	Weights
webssl-dino300m-full2b-224	14x14	224x224	2B (MC-2B)	Link	Link
webssl-dino1b-full2b-224	14x14	224x224	2B (MC-2B)	Link	Link
webssl-dino2b-full2b-224	14x14	224x224	2B (MC-2B)	Link	Link
webssl-dino3b-full2b-224	14x14	224x224	2B (MC-2B)	Link	Link
webssl-dino5b-full2b-224	14x14	224x224	2B (MC-2B)	Link	Link
webssl-dino7b-full8b-224 ⭐	14x14	224x224	8B (MC-2B)	Link	Link
webssl-dino7b-full8b-378 ⭐	14x14	378x378	8B (MC-2B)	Link	Link
webssl-dino7b-full8b-518 ⭐	14x14	518x518	8B (MC-2B)	Link	Link

Model Notes:

webssl-dino7b-full8b-224 ⭐: Best 224x224 resolution model
webssl-dino7b-full8b-378 ⭐: Better performance with 378x378 resolution
webssl-dino7b-full8b-518 ⭐: Best overall performance with 518x518 resolution

Filtered Data Models

These models were trained on filtered subsets of MC-2B images with a higher concentration of text (e.g. signs, charts, tables, annotations, etc). This enhances OCR & Chart understanding capabilities without a notable performance drop in other VQA categories, relative to same-size models trained on the full data.

Model	Patch Size	Resolution	Data	HuggingFace	Weights
webssl-dino2b-light2b-224	14x14	224x224	2B (MC-2B light)	Link	Link
webssl-dino2b-heavy2b-224	14x14	224x224	2B (MC-2B heavy)	Link	Link
webssl-dino3b-light2b-224	14x14	224x224	2B (MC-2B light)	Link	Link
webssl-dino3b-heavy2b-224	14x14	224x224	2B (MC-2B heavy)	Link	Link

Data Notes:

MC-2B light: 50.3% subset of MC-2B images that contain text
MC-2B heavy: 1.3% subset of MC-2B images that contain charts/documents

Web-MAE Models

Web-MAE is a family of MAE models ranging from 0.3B to 3B parameters, trained on larger scale web images. We release only the encoder for feature extraction.

Model	Patch Size	Resolution	Data	HuggingFace	Weights
webssl-mae300m-full2b-224	16x16	224x224	2B (MC-2B)	Link	Link
webssl-mae700m-full2b-224	14x14	224x224	2B (MC-2B)	Link	Link
webssl-mae1b-full2b-224	14x14	224x224	2B (MC-2B)	Link	Link
webssl-mae2b-full2b-224	14x14	224x224	2B (MC-2B)	Link	Link
webssl-mae3b-full2b-224	14x14	224x224	2B (MC-2B)	Link	Link

Additional Models

In response to community feedback, we are open-sourcing a few add-ons that are not part of the official release. Please see ADDITIONAL_MODELS.md for more details.

Installation

It is possible that older or newer versions will work. However, we haven't tested them for this inference code.

conda create -n webssl python=3.11
conda activate webssl
pip install torch==2.5.1 torchvision==0.20.1 xformers --index-url https://download.pytorch.org/whl/cu124
pip install transformers==4.48.0 huggingface-hub==0.27.1 timm==1.0.15

Usage

We provide two examples to use our models with both HuggingFace and native PyTorch. Note that you are not limited to using the pretraining resolution for inference, however, you will probably get the best results by inferencing with the same resolution.

1. Using HuggingFace Transformers

You may choose to download the model weights locally first using huggingface-cli. This is convenient when you don't have a large cache or when the network is slow.

E.g. huggingface-cli download facebook/webssl-dino7b-full8b-518 --local-dir YOUR_PATH, then supply YOUR_PATH to from_pretrained().

from transformers import AutoImageProcessor, Dinov2Model

# Load a Web-DINO model
model_name = "facebook/webssl-dino1b-full2b-224"
processor = AutoImageProcessor.from_pretrained(model_name)
model = Dinov2Model.from_pretrained(model_name, attn_implementation='sdpa') # 'eager' attention also supported
model.cuda().eval()

# Process an image
from PIL import Image

image = Image.open("sample_images/bird.JPEG")
with torch.no_grad():
  inputs = processor(images=image, return_tensors="pt").to('cuda')
  outputs = model(**inputs)
  last_hidden_states = outputs.last_hidden_state

2. Using PyTorch with original weights

from dinov2.vision_transformer import webssl_dino1b_full2b_224
import torch
from PIL import Image
from torchvision import transforms

# Define image transformation
transform = transforms.Compose([
    transforms.Resize(256, interpolation=transforms.InterpolationMode.BICUBIC),
    transforms.CenterCrop(224),
    transforms.ToTensor(),
    transforms.Normalize(mean=(0.485, 0.456, 0.406), std=(0.229, 0.224, 0.225))
])

# Load model
model = webssl_dino1b_full2b_224()

# Load weights
checkpoint_path = "path/to/downloaded/weights.pth"
state_dict = torch.load(checkpoint_path, map_location="cpu")
msg = model.load_state_dict(state_dict, strict=False)
print(f"Loaded weights: {msg}")
model.cuda().eval()

# Process an image
image = Image.open("sample_images/bird.JPEG")
x = transform(image).unsqueeze(0).cuda()
with torch.no_grad():
    features = model.forward_features(x)
    patch_features = features['x_norm_patchtokens']

See demo_webdino.py and demo_webmae.py for a complete example comparing HuggingFace and PyTorch implementations.

Citation

If you find this repository useful for your research, please consider citing:

@article{fan2025scaling,
  title={Scaling Language-Free Visual Representation Learning},
  author={Fan, David and Tong, Shengbang and Zhu, Jiachen and Sinha, Koustuv and Liu, Zhuang and Chen, Xinlei and Rabbat, Michael and Ballas, Nicolas and LeCun, Yann and Bar, Amir and others},
  journal={arXiv preprint arXiv:2504.01017},
  year={2025}
}

License

The majority of Web-SSL is licensed under CC-BY-NC, however portions of the project are available under separate license terms: DINOv2 is licensed under the Apache 2.0 license.

Acknowledgements

We thank the DINOv2 and MAE teams for their excellent codebases, and the MetaCLIP team for the wonderful MetaCLIP dataset and codebase. We thank the Cambrian team for their insightful study into the role of vision encoders in MLLMs and their evaluation suite. Lastly, we thank our amazing collaborators in FAIR and the Meta Open-Source team for making this possible.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
dinov2		dinov2
mae		mae
sample_images		sample_images
.gitignore		.gitignore
ADDITIONAL_MODELS.md		ADDITIONAL_MODELS.md
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
demo_webdino.py		demo_webdino.py
demo_webmae.py		demo_webmae.py
test_processor.py		test_processor.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Web-SSL: Scaling Language-Free Visual Representation Learning

Overview

Our Models

Web-DINO Models

Standard Models

Filtered Data Models

Web-MAE Models

Additional Models

Installation

Usage

1. Using HuggingFace Transformers

2. Using PyTorch with original weights

Citation

License

Acknowledgements

About

Releases

Packages

Contributors 2

Languages

License

facebookresearch/webssl

Folders and files

Latest commit

History

Repository files navigation

Web-SSL: Scaling Language-Free Visual Representation Learning

Overview

Our Models

Web-DINO Models

Standard Models

Filtered Data Models

Web-MAE Models

Additional Models

Installation

Usage

1. Using HuggingFace Transformers

2. Using PyTorch with original weights

Citation

License

Acknowledgements

About

Resources

License

Code of conduct

Security policy

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages