Skip to content

Commit 3082878

Browse files
authored
refactor clip inference (rom1504#107)
* refactor clip inference divides in * reader: reads the files or wds into tensors of images and text * mapper: transform tensors into embeddings and metadata * write: write the embeddings and metadata * runner combine reader, mapper and writer * distributor: run runner using various distribution strategies * main: use all of that to provide the whole feature distribution is based on output partitions * add logger module * add tool to build pex * fix logger * make ci better * Remove pytest coverage
1 parent 2e351a9 commit 3082878

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

43 files changed

+2043
-418
lines changed

Diff for: .github/workflows/ci.yml

+28-4
Original file line numberDiff line numberDiff line change
@@ -9,8 +9,25 @@ on:
99
- main
1010

1111
jobs:
12-
build:
13-
12+
lint:
13+
runs-on: ubuntu-latest
14+
steps:
15+
- uses: actions/checkout@v2
16+
- name: Set up Python 3.8
17+
uses: actions/setup-python@v2
18+
with:
19+
python-version: 3.8
20+
- name: Install
21+
run: |
22+
python3 -m venv .env
23+
source .env/bin/activate
24+
python -m pip install -U pip
25+
make install-dev
26+
- name: Lint
27+
run: |
28+
source .env/bin/activate
29+
make lint
30+
tests:
1431
runs-on: ubuntu-latest
1532
strategy:
1633
matrix:
@@ -22,6 +39,13 @@ jobs:
2239
uses: actions/setup-python@v2
2340
with:
2441
python-version: ${{ matrix.python-version }}
25-
- name: Install, lint and unit tests
42+
- name: Install
43+
run: |
44+
python3 -m venv .env
45+
source .env/bin/activate
46+
make install
47+
make install-dev
48+
- name: Unit tests
2649
run: |
27-
make venv-lint-test
50+
source .env/bin/activate
51+
make test

Diff for: .github/workflows/python-publish.yml

+22-17
Original file line numberDiff line numberDiff line change
@@ -1,35 +1,40 @@
1-
# This workflows will upload a Python Package using Twine when a release is created
2-
# For more information see: https://help.github.com/en/actions/language-and-framework-guides/using-python-with-github-actions#publishing-to-package-registries
3-
4-
name: Upload Python Package
1+
name: Release
52

63
on:
7-
release:
8-
types: [created]
9-
workflow_dispatch:
10-
4+
push:
5+
branches:
6+
- main
117
jobs:
128
deploy:
13-
149
runs-on: ubuntu-latest
15-
1610
steps:
1711
- uses: actions/checkout@v2
18-
- name: Use Node.js 14.x
19-
uses: actions/setup-node@v1
12+
- uses: actions-ecosystem/action-regex-match@v2
13+
id: regex-match
2014
with:
21-
node-version: 14.x
22-
- run: cd front && npm install
23-
- run: cd front && npm run build
15+
text: ${{ github.event.head_commit.message }}
16+
regex: '^Release ([^ ]+)'
2417
- name: Set up Python
2518
uses: actions/setup-python@v2
2619
with:
27-
python-version: '3.x'
20+
python-version: '3.8'
2821
- name: Install dependencies
2922
run: |
3023
python -m pip install --upgrade pip
31-
pip install setuptools wheel twine
24+
pip install setuptools wheel twine pex
25+
- name: Build pex
26+
run: |
27+
make build-pex
28+
- name: Release
29+
if: ${{ steps.regex-match.outputs.match != '' }}
30+
uses: softprops/action-gh-release@v1
31+
with:
32+
files: |
33+
clip_retrieval_torch.tgz
34+
clip_retrieval.tgz
35+
tag_name: ${{ steps.regex-match.outputs.group1 }}
3236
- name: Build and publish
37+
if: ${{ steps.regex-match.outputs.match != '' }}
3338
env:
3439
TWINE_USERNAME: ${{ secrets.PYPI_USERNAME }}
3540
TWINE_PASSWORD: ${{ secrets.PYPI_PASSWORD }}

Diff for: .gitignore

+6-2
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,11 @@ cat
1212
embedding_folder
1313
index_folder
1414
indices_paths.json
15-
.coverage
15+
.coverage*
1616
test_folder
1717
build
18-
dist
18+
dist
19+
wandb
20+
.pexing
21+
*.tgz
22+
*.pex

Diff for: Makefile

+10-2
Original file line numberDiff line numberDiff line change
@@ -13,14 +13,22 @@ lint: ## [Local development] Run mypy, pylint and black
1313
black: ## [Local development] Auto-format python code using black
1414
python -m black -l 120 .
1515

16+
build-pex:
17+
python3 -m venv .pexing
18+
. .pexing/bin/activate && python -m pip install -U pip && python -m pip install pex
19+
. .pexing/bin/activate && python -m pex --layout packed -f https://download.pytorch.org/whl/cu113/torch_stable.html setuptools s3fs==2021.11.0 pyspark==3.2.0 torch==1.10.2+cu113 torchvision==0.11.3+cu113 . -o clip_retrieval.pex -v
20+
rm -rf .pexing
21+
tar czf clip_retrieval_torch.tgz clip_retrieval.pex/.deps/torch-1.10.2+cu113-cp38-cp38-linux_x86_64.whl
22+
tar czf clip_retrieval.tgz --exclude clip_retrieval.pex/.deps/torch-1.10.2+cu113-cp38-cp38-linux_x86_64.whl clip_retrieval.pex
23+
1624
venv-lint-test: ## [Continuous integration]
1725
python3 -m venv .env && . .env/bin/activate && make install install-dev lint test && rm -rf .env
1826

1927
test: ## [Local development] Run unit tests
2028
rm -rf tests/test_folder/
21-
python -m pytest -v --cov=clip_retrieval --cov-report term-missing --cov-fail-under 0.0 tests
29+
python -m pytest -x -s -v tests
2230

2331
.PHONY: help
2432

2533
help: # Run `make help` to get help on the make commands
26-
@grep -E '^[0-9a-zA-Z_-]+:.*?## .*$$' $(MAKEFILE_LIST) | sort | awk 'BEGIN {FS = ":.*?## "}; {printf "\033[36m%-30s\033[0m %s\n", $$1, $$2}'
34+
@grep -E '^[0-9a-zA-Z_-]+:.*?## .*$$' $(MAKEFILE_LIST) | sort | awk 'BEGIN {FS = ":.*?## "}; {printf "\033[36m%-30s\033[0m %s\n", $$1, $$2}'

Diff for: README.md

+7-1
Original file line numberDiff line numberDiff line change
@@ -83,13 +83,17 @@ clip_inference turn a set of text+image into clip embeddings
8383
* **enable_image** Enable image processing (default *True*)
8484
* **enable_metadata** Enable metadata processing (default *False*)
8585
* **write_batch_size** Write batch size (default *10**6*)
86-
* **subset_size** Only process a subset of this size (default *None*)
8786
* **wds_image_key** Key to use for images in webdataset. (default *jpg*)
8887
* **wds_caption_key** Key to use for captions in webdataset. (default *txt*)
8988
* **clip_model** CLIP model to load (default *ViT-B/32*)
9089
* **mclip_model** MCLIP model to load (default *sentence-transformers/clip-ViT-B-32-multilingual-v1*)
9190
* **use_mclip** If False it performs the inference using CLIP; MCLIP otherwise (default *False*)
9291
* **use_jit** uses jit for the clip model (default *True*)
92+
* **distribution_strategy** choose how to distribute the job, see distribution section for details (default *sequential*)
93+
* **wds_number_file_per_input_file** estimation of the number of sample per tar if using wds and not specifying output_partition_count (default *10000*)
94+
* **output_partition_count** number of output partitions (default *None*)
95+
* **wandb_project** wandb project to use (default *clip_retrieval*)
96+
* **enable_wandb** whether to use wandb (default *False*)
9397

9498

9599
### Loading/writing files on hdfs
@@ -281,6 +285,8 @@ make test
281285

282286
You can use `make black` to reformat the code
283287

288+
`python -m pytest -x -s -v tests -k "test_runner"` to run a specific test
289+
284290
If you want to use the front through the python backend or frontend, run
285291
```
286292
cd front

Diff for: clip_retrieval/__init__.py

+3-1
Original file line numberDiff line numberDiff line change
@@ -3,6 +3,8 @@
33
from .clip_back import clip_back
44
from .clip_filter import clip_filter
55
from .clip_index import clip_index
6-
from .clip_inference import clip_inference
6+
from .clip_inference.main import main as clip_inference
7+
8+
# from .clip_inference import clip_inference
79
from .clip_end2end import clip_end2end
810
from .clip_front import clip_front

Diff for: clip_retrieval/cli.py

+4
Original file line numberDiff line numberDiff line change
@@ -21,3 +21,7 @@ def main():
2121
"front": clip_front,
2222
}
2323
)
24+
25+
26+
if __name__ == "__main__":
27+
main()

0 commit comments

Comments
 (0)