|
8 | 8 | - `cd /somehwere/with/lots/of/space`
|
9 | 9 | 4. Download the index parts from the hugging-face repository
|
10 | 10 | - `mkdir index-parts && cd index-parts`
|
11 |
| - - `for i in {00..79}; do aria2c -x 16 https://huggingface.co/datasets/laion/laion5b-h14-index/resolve/main/index-parts/$i.index -o $i.parquet; done` |
| 11 | + - `for i in {00..79}; do aria2c -x 16 https://huggingface.co/datasets/laion/laion5b-h14-index/resolve/main/index-parts/$i.index -o $i.index; done` |
12 | 12 | - `cd ..`
|
13 | 13 | 5. Combine the index parts using the following command
|
14 | 14 | - `clip-retrieval index_combiner --input_folder "index-parts" --output_folder "combined-indices"`
|
15 | 15 | 6. Now download the metadata parts from the following metadata repos
|
16 | 16 |
|
17 | 17 | - ***multi embeddings***
|
18 | 18 | - `mkdir multi-embeddings && cd multi-embeddings`
|
19 |
| - - `for i in {0000..2268}; do aria2c -x 16 https://huggingface.co/datasets/laion/laion2b-multi-vit-h-14-embeddings/resolve/main/metadata/metadata_$i.parquet -o $i.parquet; done` |
| 19 | + - `for i in {0000..2268}; do aria2c -x 16 https://huggingface.co/datasets/laion/laion2b-multi-vit-h-14-embeddings/resolve/main/metadata/metadata_$i.parquet -o metadata_$i.parquet; done` |
20 | 20 | - `cd ..`
|
21 | 21 | - ***english embeddings***
|
22 | 22 | - `mkdir en-embeddings && cd en-embeddings`
|
23 |
| - - `for i in {0000..2313}; do aria2c -x 16 https://huggingface.co/datasets/laion/laion2b-en-vit-h-14-embeddings/resolve/main/metadata/metadata_$i.parquet -o $i.parquet; done` |
| 23 | + - `for i in {0000..2313}; do aria2c -x 16 https://huggingface.co/datasets/laion/laion2b-en-vit-h-14-embeddings/resolve/main/metadata/metadata_$i.parquet -o metadata_$i.parquet; done` |
24 | 24 | - `cd ..`
|
25 | 25 | - ***nolang embeddings***
|
26 | 26 | - `mkdir nolang-embeddings && nolang en-embeddings`
|
27 |
| - - `for i in {0000..1273}; do aria2c -x 16 https://huggingface.co/datasets/laion/laion1b-nolang-vit-h-14-embeddings/resolve/main/metadata/metadata_$i.parquet -o $i.parquet; done` |
| 27 | + - `for i in {0000..1273}; do aria2c -x 16 https://huggingface.co/datasets/laion/laion1b-nolang-vit-h-14-embeddings/resolve/main/metadata/metadata_$i.parquet -o metadata_$i.parquet; done` |
28 | 28 | - `cd ..`
|
29 | 29 |
|
30 |
| -7. Now run the metadata combiner for each of the metadata folders |
| 30 | +7. Now run the metadata combiner for each of the metadata folders (Warning: ensure all metadata parquet files are present before combining them, or the combined arrow file may be misaligned with the index) |
31 | 31 |
|
32 | 32 | - ***multi embeddings***
|
33 | 33 | - `clip-retrieval parquet_to_arrow --parquet_folder="multi-embeddings" --output_arrow_folder="multi-combined" --columns_to_return='["url", "caption"]'`
|
|
50 | 50 | ```
|
51 | 51 | {
|
52 | 52 | "laion5B-H-14": {
|
53 |
| - "indice_folder": "laion5B_H14", |
| 53 | + "indice_folder": "Laion5B_H14", |
54 | 54 | "provide_safety_model": true,
|
55 | 55 | "enable_faiss_memory_mapping": true,
|
56 | 56 | "use_arrow": true,
|
|
0 commit comments