Skip to content

Latest commit

 

History

History
 
 

training

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 
 
 

Training on DataCompDR with OpenCLIP

We provide release code and a patch to OpenCLIP for training models on DataCompDR.

Data

Our reinforcements to DataComp are available on HuggingFace.

Our data does not include the original images and captions. For DataCompDR-12M, there is a corresponding DataComp-12M with original captions. One needs to download both datasets, then run the following script to join them:

#!/bin/bash
DATACOMP12M_PATH="./datasets/DataComp-12M/" # Download path of DataComp-12M from HF
DATACOMPDR12M_NOIMG_PATH="./datasets/DataCompDR-12M-noimage/" # Download path of DataCompDR-12M from HF
DATACOMPDR12M_PATH="./datasets/DataCompDR-12M/"
for  i in {00000000..00001023}
do
  mkdir tmp
  tar -xf $DATACOMP12M_PATH/${i}.tar -C tmp
  tar -xf $DATACOMP12M_NOIMG_PATH/${i}.tar -C tmp
  tar -cf $DATACOMPDR12M_PATH/${i}.tar -C tmp *.*
  rm -rf tmp
done

The images have to be downloaded separately. See hf_dataset_example.py for an example of downloading a single image.

Installing dependencies

We use OpenCLIP for training. We have made minor modifications to OpenCLIP for support of loading reinforcements and the training loss. To checkout the specific version of each library and apply our corresponding patch run the following commands in order:

# Clone MobileCLIP repository
git clone [email protected]:apple/ml-mobileclip.git
cd ml-mobileclip/

# Clone OpenCLIP repository, apply patch, and install
git clone https://github.com/mlfoundations/open_clip.git
cd open_clip
git checkout cf86ee7ec4658845f640858ecd34d0f15588271a
git apply ../open_clip.patch  # Support for sampling without replacement
cp ../configs/ ./ -r
cp ../dr/ ./src/training/ -r

Training

We provide scripts for training on DataCompDR-12M and DataCompDR-1B.

cd open_clip/
bash configs/run_datacomp12m.sh  # Train a ViT-B/16 on DataComp-12M without DR
bash configs/run_datacompdr12m.sh  # Train a ViT-B/16 on DataComp-12M with DR
bash configs/run_datacompdr1B.sh  # Train a ViT-B/16 on DataComp-1B with DR