Skip to content

Commit 746e907

Browse files
authored
Update laion5B_h14 guide (rom1504#307)
-Ensured aria2 download commands use the correct filename and extension -Added warning to check for correct file count when combining metadata (if there are any missing files, it may ruin search results) -Capitalized "Laion5B_H14" folder name in the indices.json (makes it distinct from the index name itself, and aligns with the rest of the guide which assumes the folder name is capitalized)
1 parent 0b623d8 commit 746e907

File tree

1 file changed

+6
-6
lines changed

1 file changed

+6
-6
lines changed

Diff for: docs/laion5B_h14_back.md

+6-6
Original file line numberDiff line numberDiff line change
@@ -8,26 +8,26 @@
88
- `cd /somehwere/with/lots/of/space`
99
4. Download the index parts from the hugging-face repository
1010
- `mkdir index-parts && cd index-parts`
11-
- `for i in {00..79}; do aria2c -x 16 https://huggingface.co/datasets/laion/laion5b-h14-index/resolve/main/index-parts/$i.index -o $i.parquet; done`
11+
- `for i in {00..79}; do aria2c -x 16 https://huggingface.co/datasets/laion/laion5b-h14-index/resolve/main/index-parts/$i.index -o $i.index; done`
1212
- `cd ..`
1313
5. Combine the index parts using the following command
1414
- `clip-retrieval index_combiner --input_folder "index-parts" --output_folder "combined-indices"`
1515
6. Now download the metadata parts from the following metadata repos
1616

1717
- ***multi embeddings***
1818
- `mkdir multi-embeddings && cd multi-embeddings`
19-
- `for i in {0000..2268}; do aria2c -x 16 https://huggingface.co/datasets/laion/laion2b-multi-vit-h-14-embeddings/resolve/main/metadata/metadata_$i.parquet -o $i.parquet; done`
19+
- `for i in {0000..2268}; do aria2c -x 16 https://huggingface.co/datasets/laion/laion2b-multi-vit-h-14-embeddings/resolve/main/metadata/metadata_$i.parquet -o metadata_$i.parquet; done`
2020
- `cd ..`
2121
- ***english embeddings***
2222
- `mkdir en-embeddings && cd en-embeddings`
23-
- `for i in {0000..2313}; do aria2c -x 16 https://huggingface.co/datasets/laion/laion2b-en-vit-h-14-embeddings/resolve/main/metadata/metadata_$i.parquet -o $i.parquet; done`
23+
- `for i in {0000..2313}; do aria2c -x 16 https://huggingface.co/datasets/laion/laion2b-en-vit-h-14-embeddings/resolve/main/metadata/metadata_$i.parquet -o metadata_$i.parquet; done`
2424
- `cd ..`
2525
- ***nolang embeddings***
2626
- `mkdir nolang-embeddings && nolang en-embeddings`
27-
- `for i in {0000..1273}; do aria2c -x 16 https://huggingface.co/datasets/laion/laion1b-nolang-vit-h-14-embeddings/resolve/main/metadata/metadata_$i.parquet -o $i.parquet; done`
27+
- `for i in {0000..1273}; do aria2c -x 16 https://huggingface.co/datasets/laion/laion1b-nolang-vit-h-14-embeddings/resolve/main/metadata/metadata_$i.parquet -o metadata_$i.parquet; done`
2828
- `cd ..`
2929

30-
7. Now run the metadata combiner for each of the metadata folders
30+
7. Now run the metadata combiner for each of the metadata folders (Warning: ensure all metadata parquet files are present before combining them, or the combined arrow file may be misaligned with the index)
3131

3232
- ***multi embeddings***
3333
- `clip-retrieval parquet_to_arrow --parquet_folder="multi-embeddings" --output_arrow_folder="multi-combined" --columns_to_return='["url", "caption"]'`
@@ -50,7 +50,7 @@
5050
```
5151
{
5252
"laion5B-H-14": {
53-
"indice_folder": "laion5B_H14",
53+
"indice_folder": "Laion5B_H14",
5454
"provide_safety_model": true,
5555
"enable_faiss_memory_mapping": true,
5656
"use_arrow": true,

0 commit comments

Comments
 (0)