You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardexpand all lines: docs/getting_started/environment.md
+9-6
Original file line number
Diff line number
Diff line change
@@ -4,7 +4,7 @@ We encourage users to do everything inside a Docker container spawned with our p
4
4
5
5
## Zeus Docker image
6
6
7
-
We provide a pre-built Docker image in Docker Hub: https://hub.docker.com/r/symbioticlab/zeus.
7
+
We provide a pre-built Docker image in [Docker Hub](https://hub.docker.com/r/symbioticlab/zeus){.external}.
8
8
On top of the `nvidia/cuda:11.3.1-devel-ubuntu20.04` image, the following are provided:
9
9
10
10
1. CMake 3.22.0
@@ -39,15 +39,18 @@ docker run -it \
39
39
2.`SYS_ADMIN` capability is needed to manage the power configurations of the GPU via NVML.
40
40
3. PyTorch DataLoader workers need enough shared memory for IPC. If the PyTorch training process dies with a Bus error, consider increasing this even more.
41
41
42
-
If you would like your changes to `zeus/` outside the container to be immediately applied inside the container, mount the repository into the container:
42
+
Use the `-v` option to mount outside data into the container.
43
+
For instance, if you would like your changes to `zeus/` outside the container to be immediately applied inside the container, mount the repository into the container.
44
+
You can also mount training data into the container.
-[Integrating Zeus with CIFAR100 dataset](https://github.com/SymbioticLab/Zeus/tree/master/examples/cifar100){.external}
106
-
-[Integrating Zeus with ImageNet dataset](https://github.com/SymbioticLab/Zeus/tree/master/examples/imagenet){.external}
116
+
-[Integrating Zeus with CIFAR100 dataset](https://github.com/SymbioticLab/Zeus/tree/master/examples/cifar100){.external}
117
+
-[Integrating Zeus with ImageNet dataset](https://github.com/SymbioticLab/Zeus/tree/master/examples/imagenet){.external}
107
118
-[Integrating Zeus with NLP](https://github.com/SymbioticLab/Zeus/tree/master/examples/capriccio){.external}
108
119
109
120
110
121
## Recurring jobs
111
122
112
-
The optimal batch size is explored *across* multiple job runs using a Multi-Armed Bandit algorithm.
123
+
!!! Info
124
+
We plan to integrate [`ZeusMaster`][zeus.run.ZeusMaster] with an MLOps platform like [KubeFlow](https://www.kubeflow.org/).
125
+
Let us know about your preferences, use cases, and expectations by [posting an issue](https://github.com/SymbioticLab/Zeus/issues/new?assignees=&labels=&template=feature_request.md&title=Regarding%20Integration%20with%20MLOps%20Platroms)!
126
+
127
+
The cost-optimal batch size is located *across* multiple job runs using a Multi-Armed Bandit algorithm.
113
128
First, go through the steps for non-recurring jobs.
114
129
[`ZeusDataLoader`][zeus.run.ZeusDataLoader] will transparently optimize the GPU power limit for any given batch size.
115
130
Then, you can use [`ZeusMaster`][zeus.run.ZeusMaster] to drive recurring jobs and batch size optimization.
116
131
117
132
This example will come in handy:
118
133
119
134
-[Running trace-driven simulation on single recurring jobs and the Alibaba GPU cluster trace](https://github.com/SymbioticLab/Zeus/tree/master/examples/trace_driven){.external}
120
-
121
-
!!! Info
122
-
We plan to integrate [`ZeusMaster`][zeus.run.ZeusMaster] with an MLOps platform like [KubeFlow](https://www.kubeflow.org/).
123
-
Feel free to let us know about your preferences, use cases, and expectations.
-[Integrating Zeus with Natural Language Processing and Huggingface](https://github.com/SymbioticLab/Zeus/tree/master/examples/capriccio){.external}
29
29
-[Running trace-driven simulation on single recurring jobs and the Alibaba GPU cluster trace](https://github.com/SymbioticLab/Zeus/tree/master/examples/trace_driven){.external}
Copy file name to clipboardexpand all lines: examples/capriccio/README.md
+4-4
Original file line number
Diff line number
Diff line change
@@ -21,7 +21,7 @@ While our paper is about optimizing the batch size and power limit over multiple
21
21
### Dependencies
22
22
23
23
1. Generate Capriccio, following the instructions in [Capriccio's README.md](../../capriccio/).
24
-
1.Install`zeus` and build the power monitor, following [Installing and Building](https://ml.energy/zeus/getting_started/installing_and_building/).
24
+
1.If you're not using our [Docker image](https://ml.energy/zeus/getting_started/environment/), install`zeus` and build the power monitor, following [Installing and Building](https://ml.energy/zeus/getting_started/installing_and_building/).
25
25
1. Install python dependencies for this example:
26
26
```sh
27
27
pip install -r requirements.txt
@@ -60,7 +60,7 @@ This example shows how to integrate [`ZeusDataLoader`](https://ml.energy/zeus/re
60
60
### Dependencies
61
61
62
62
1. Generate Capriccio, following the instructions in [Capriccio's README.md](../../capriccio/).
63
-
1. Install `zeus` and build the power monitor, following [Installing and Building](https://ml.energy/zeus/getting_started/installing_and_building/).
63
+
1. If you're not using our [Docker image](https://ml.energy/zeus/getting_started/environment/), install`zeus` and build the power monitor, following [Installing and Building](https://ml.energy/zeus/getting_started/installing_and_building/).
64
64
1. Install python dependencies for this example:
65
65
```sh
66
66
pip install -r requirements.txt
@@ -92,7 +92,7 @@ You can use Zeus's [`ProfileDataLoader`](https://ml.energy/zeus/reference/profil
92
92
### Dependencies
93
93
94
94
1. Generate Capriccio, following the instructions in [Capriccio's README.md](../../capriccio/).
95
-
1. Install `zeus` and build the power monitor, following [Installing and Building](https://ml.energy/zeus/getting_started/installing_and_building/).
95
+
1. If you're not using our [Docker image](https://ml.energy/zeus/getting_started/environment/), install `zeus` and build the power monitor, following [Installing and Building](https://ml.energy/zeus/getting_started/installing_and_building/).
96
96
1. Install python dependencies for this example:
97
97
```sh
98
98
pip install -r requirements.txt
@@ -129,7 +129,7 @@ At the same time, a CSV file with headers epoch number, split (train or eval), a
129
129
### Dependencies
130
130
131
131
1. Generate Capriccio, following the instructions in [Capriccio's README.md](../../capriccio/).
132
-
1. Only for those not using our Docker image, install PyTorch separately:
132
+
1. Only for those not using our [Docker image](https://ml.energy/zeus/getting_started/environment/), install PyTorch separately:
Copy file name to clipboardexpand all lines: examples/imagenet/README.md
+15-36
Original file line number
Diff line number
Diff line change
@@ -22,27 +22,15 @@ While our paper is about optimizing the batch size and power limit over multiple
22
22
23
23
### Dependencies
24
24
25
+
All packages are pre-installed if you're using our [Docker image](https://ml.energy/zeus/getting_started/environment/).
26
+
You just need to download the ImageNet data and mount it to the Docker container with the `-v` option.
27
+
25
28
1. Download the ILSVRC2012 dataset from [the ImageNet homepage](http://www.image-net.org/).
26
29
Then, extract archives using [this script](https://github.com/pytorch/examples/blob/main/imagenet/extract_ILSVRC.sh) provided by PyTorch.
27
-
2. Install `zeus` and build the power monitor, following [Installing and Building](https://ml.energy/zeus/getting_started/installing_and_building/).
28
-
- When spawning the container, mount the dataset to the container by specifying `-v $DATA_DIR:/data/imagenet`. The complete command will be:
29
-
```sh
30
-
# Working directory is repository root
31
-
docker run -it \
32
-
--gpus all `# Specify the number of GPUs to use. When 'all' is set, all the GPUs will be used.` \
33
-
--cap-add SYS_ADMIN \
34
-
--shm-size 64G \
35
-
-v $(pwd):/workspace/zeus \
36
-
-v $DATA_DIR:/data/imagenet `# Mount the dataset to the container` \
37
-
symbioticlab/zeus:latest \
38
-
bash
39
-
```
40
-
- Zeus will always use **ALL** the GPUs available to it. If you want to use specific GPUs on your node, please use the `CUDA_VISIBLE_DEVICES` environment variable, or use our Docker image and replace the argument following `--gpus`in the above `docker run`command with your preference. For example:
41
-
- Mount 2 GPUs to the Docker container: `--gpus 2`.
42
-
- Mount specific GPUs to the Docker container: `--gpus '"device=0,1"'`.
43
-
3. Install python dependencies for this example:
30
+
1. Install `zeus` and build the power monitor, following [Installing and Building](https://ml.energy/zeus/getting_started/installing_and_building/).
31
+
1. Install `torchvision`:
44
32
```sh
45
-
pip install -r requirements.txt
33
+
conda install -c pytorch torchvision==0.11.2
46
34
```
47
35
48
36
### Example command
@@ -85,25 +73,13 @@ This example shows how to integrate [`ZeusDataLoader`](https://ml.energy/zeus/re
85
73
86
74
### Dependencies
87
75
76
+
All packages are pre-installed if you're using our [Docker image](https://ml.energy/zeus/getting_started/environment/).
77
+
You just need to download the ImageNet data and mount it to the Docker container with the `-v` option.
78
+
88
79
1. Download the ILSVRC2012 dataset from [the ImageNet homepage](http://www.image-net.org/).
89
80
Then, extract archives using [this script](https://github.com/pytorch/examples/blob/main/imagenet/extract_ILSVRC.sh) provided by PyTorch.
90
-
2. Install `zeus` and build the power monitor, following [Installing and Building](https://ml.energy/zeus/getting_started/installing_and_building/).
91
-
- When spawning the container, mount the dataset to the container by specifying `-v $DATA_DIR:/data/imagenet`. The completecommand will be:
92
-
```sh
93
-
# Working directory is repository root
94
-
docker run -it \
95
-
--gpus all `# Specify the number of GPUs to use. When 'all' is set, all the GPUs will be used.` \
96
-
--cap-add SYS_ADMIN \
97
-
--shm-size 64G \
98
-
-v $(pwd):/workspace/zeus \
99
-
-v $DATA_DIR:/data/imagenet `# Mount the dataset to the container` \
100
-
symbioticlab/zeus:latest \
101
-
bash
102
-
```
103
-
- Zeus will always use **ALL** the GPUs available to it. If you want to use specific GPUs on your node, please use the `CUDA_VISIBLE_DEVICES` environment variable, or use our Docker image and replace the argument following `--gpus`in the above `docker run`command with your preference. For example:
104
-
- Mount 2 GPUs to the Docker container: `--gpus 2`.
105
-
- Mount specific GPUs to the Docker container: `--gpus '"device=0,1"'`.
106
-
3. Only for those not using our Docker image, install `torchvision` separately:
81
+
1. Install `zeus` and build the power monitor, following [Installing and Building](https://ml.energy/zeus/getting_started/installing_and_building/).
82
+
1. Install `torchvision`:
107
83
```sh
108
84
conda install -c pytorch torchvision==0.11.2
109
85
```
@@ -133,9 +109,12 @@ python run_zeus.py \
133
109
134
110
### Dependencies
135
111
112
+
All packages are pre-installed if you're using our [Docker image](https://ml.energy/zeus/getting_started/environment/).
113
+
You just need to download the ImageNet data and mount it to the Docker container with the `-v` option.
114
+
136
115
1. Download the ILSVRC2012 dataset from [the ImageNet homepage](http://www.image-net.org/).
137
116
Then, extract archives using [this script](https://github.com/pytorch/examples/blob/main/imagenet/extract_ILSVRC.sh) provided by PyTorch.
138
-
2. Only for those not using our Docker image, install PyTorch, `torchvision`, and `cudatoolkit` separately:
117
+
2. Install PyTorch, `torchvision`, and `cudatoolkit`:
0 commit comments