Skip to content
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.

Commit a207905

Browse files
jaywonchungRosie-m
andcommittedOct 8, 2022
Chore: Clean up Docs and READMEs
Co-authored-by: Luoxi Meng <[email protected]>
1 parent ec3f746 commit a207905

File tree

15 files changed

+149
-124
lines changed

15 files changed

+149
-124
lines changed
 

‎Dockerfile

+3
Original file line numberDiff line numberDiff line change
@@ -50,3 +50,6 @@ ADD . /workspace/zeus
5050

5151
# When an outside zeus directory is mounted, have it apply immediately.
5252
RUN pip install -e zeus
53+
54+
# Build and bake in the Zeus monitor.
55+
RUN cd /workspace/zeus/zeus_monitor && cmake . && make && cp zeus_monitor /usr/local/bin/ && cd /workspace

‎docs/getting_started/environment.md

+9-6
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@ We encourage users to do everything inside a Docker container spawned with our p
44

55
## Zeus Docker image
66

7-
We provide a pre-built Docker image in Docker Hub: https://hub.docker.com/r/symbioticlab/zeus.
7+
We provide a pre-built Docker image in [Docker Hub](https://hub.docker.com/r/symbioticlab/zeus){.external}.
88
On top of the `nvidia/cuda:11.3.1-devel-ubuntu20.04` image, the following are provided:
99

1010
1. CMake 3.22.0
@@ -39,15 +39,18 @@ docker run -it \
3939
2. `SYS_ADMIN` capability is needed to manage the power configurations of the GPU via NVML.
4040
3. PyTorch DataLoader workers need enough shared memory for IPC. If the PyTorch training process dies with a Bus error, consider increasing this even more.
4141

42-
If you would like your changes to `zeus/` outside the container to be immediately applied inside the container, mount the repository into the container:
42+
Use the `-v` option to mount outside data into the container.
43+
For instance, if you would like your changes to `zeus/` outside the container to be immediately applied inside the container, mount the repository into the container.
44+
You can also mount training data into the container.
4345

4446
``` { .sh .annotate }
4547
# Working directory is repository root
4648
docker run -it \
47-
--gpus all \ # (1)!
48-
--cap-add SYS_ADMIN \ # (2)!
49-
--shm-size 64G \ # (3)!
50-
-v $(pwd):/workspace/zeus \ # (4)!
49+
--gpus all \ # (1)!
50+
--cap-add SYS_ADMIN \ # (2)!
51+
--shm-size 64G \ # (3)!
52+
-v $(pwd):/workspace/zeus \ # (4)!
53+
-v /data/imagenet:/data/imagenet:ro \
5154
symbioticlab/zeus:latest \
5255
bash
5356
```

‎docs/getting_started/index.md

+34-23
Original file line numberDiff line numberDiff line change
@@ -40,20 +40,28 @@ for epoch_number in train_loader.epochs():
4040
```
4141

4242
### Data parallel with multi-GPU on a single-node
43-
Zeus supports only one process per GPU profiling. In data parallel training,
44-
each process has its `local_rank` within the node and will run the
45-
following code.
46-
We also specify the important steps for a better comprehension.
47-
Please refer to [the integration example with ImageNet](https://github.com/SymbioticLab/Zeus/tree/master/examples/imagenet/train.py)
48-
for the complete example.
43+
44+
!!! Important
45+
Zeus assumes that exactly one process manages one GPU, and hence
46+
one instance of [`ZeusDataLoader`][zeus.run.ZeusDataLoader] exists
47+
for each GPU.
48+
49+
Users can integrate Zeus into existing data parallel training scripts
50+
with five specific steps, which are noted below in the comments.
51+
52+
Please refer to
53+
[our integration example with ImageNet](https://github.com/SymbioticLab/Zeus/tree/master/examples/imagenet/train.py)
54+
for a complete example.
4955

5056
```python
5157
import torch
58+
import torch.distributed as dist
5259
import torchvision
5360

5461
from zeus.run import ZeusDataLoader
5562

5663
# Step 1: Initialize the default process group.
64+
# This should be done before instantiating `ZeusDataLoader`.
5765
dist.init_process_group(
5866
backend=args.dist_backend,
5967
init_method=args.dist_url,
@@ -63,9 +71,9 @@ dist.init_process_group(
6371
model = torchvision.models.resnet18()
6472
torch.cuda.set_device(local_rank)
6573
model.cuda(local_rank)
66-
# Zeus only supports one process per GPU profiling. If you are doing data
74+
# Zeus assumes that exactly one process manages one GPU. If you are doing data
6775
# parallel training, please use `DistributedDataParallel` for model replication
68-
# and specify the `device_ids` and `output_device` correctly.
76+
# and specify the `device_ids` and `output_device` as below:
6977
model = torch.nn.parallel.DistributedDataParallel(
7078
model,
7179
device_ids=[local_rank],
@@ -77,47 +85,50 @@ model = torch.nn.parallel.DistributedDataParallel(
7785
train_sampler = torch.utils.data.distributed.DistributedSampler(train_set)
7886
eval_sampler = torch.utils.data.distributed.DistributedSampler(eval_set)
7987

80-
# Step 4: Create instances of `ZeusDataLoader`.
81-
# Pass "dp" to `distributed` and samplers in the previous step to
82-
# `sampler`.
88+
# Step 4: Instantiate `ZeusDataLoader`.
89+
# `distributed="dp"` tells `ZeusDataLoader` to operate in data parallel mode.
8390
# The one instantiated with `max_epochs` becomes the train dataloader.
8491
train_loader = ZeusDataLoader(train_set, batch_size=256, max_epochs=100,
8592
sampler=train_sampler, distributed="dp")
8693
eval_loader = ZeusDataLoader(eval_set, batch_size=256, sampler=eval_sampler,
8794
distributed="dp")
8895

89-
# Step 5: Put your training code here.
96+
# Step 5: Training loop.
97+
# Use the train dataloader's `epochs` generator to allow Zeus to early-stop
98+
# based on the training cost. Use `report_metric` to let Zeus know the current
99+
# validation metric.
90100
for epoch_number in train_loader.epochs():
91101
for batch in train_loader:
92102
# Learn from batch
93103
for batch in eval_loader:
94104
# Evaluate on batch
95105

96-
# If doing data parallel training, please make sure to call
97-
# `torch.distributed.all_reduce()` to reduce the validation metric
98-
# across all GPUs before calling `train_loader.report_metric()`.
99-
train_loader.report_metric(validation_metric)
106+
# Make sure you all-reduce the validation metric across all GPUs,
107+
# since Zeus expects the final validation metric.
108+
val_metric_tensor = torch.tensor([validation_metric], device="cuda")
109+
dist.all_reduce(val_metric_tensor, async_op=False)
110+
train_loader.report_metric(val_metric_tensor.item())
100111
```
101112

102113
The following examples will help:
103114

104115
- Integrating Zeus with computer vision
105-
- [Integrating Zeus with CIFAR100 dataset](https://github.com/SymbioticLab/Zeus/tree/master/examples/cifar100){.external}
106-
- [Integrating Zeus with ImageNet dataset](https://github.com/SymbioticLab/Zeus/tree/master/examples/imagenet){.external}
116+
- [Integrating Zeus with CIFAR100 dataset](https://github.com/SymbioticLab/Zeus/tree/master/examples/cifar100){.external}
117+
- [Integrating Zeus with ImageNet dataset](https://github.com/SymbioticLab/Zeus/tree/master/examples/imagenet){.external}
107118
- [Integrating Zeus with NLP](https://github.com/SymbioticLab/Zeus/tree/master/examples/capriccio){.external}
108119

109120

110121
## Recurring jobs
111122

112-
The optimal batch size is explored *across* multiple job runs using a Multi-Armed Bandit algorithm.
123+
!!! Info
124+
We plan to integrate [`ZeusMaster`][zeus.run.ZeusMaster] with an MLOps platform like [KubeFlow](https://www.kubeflow.org/).
125+
Let us know about your preferences, use cases, and expectations by [posting an issue](https://github.com/SymbioticLab/Zeus/issues/new?assignees=&labels=&template=feature_request.md&title=Regarding%20Integration%20with%20MLOps%20Platroms)!
126+
127+
The cost-optimal batch size is located *across* multiple job runs using a Multi-Armed Bandit algorithm.
113128
First, go through the steps for non-recurring jobs.
114129
[`ZeusDataLoader`][zeus.run.ZeusDataLoader] will transparently optimize the GPU power limit for any given batch size.
115130
Then, you can use [`ZeusMaster`][zeus.run.ZeusMaster] to drive recurring jobs and batch size optimization.
116131

117132
This example will come in handy:
118133

119134
- [Running trace-driven simulation on single recurring jobs and the Alibaba GPU cluster trace](https://github.com/SymbioticLab/Zeus/tree/master/examples/trace_driven){.external}
120-
121-
!!! Info
122-
We plan to integrate [`ZeusMaster`][zeus.run.ZeusMaster] with an MLOps platform like [KubeFlow](https://www.kubeflow.org/).
123-
Feel free to let us know about your preferences, use cases, and expectations.

‎docs/getting_started/installing_and_building.md

+13-5
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@ This document explains how to install the [`zeus`][zeus] Python package and how
55
!!! Tip
66
We encourage users to utilize our Docker image. Please refer to [Environment setup](./environment.md). Quick command:
77
```bash
8-
docker run -it --gpus 1 --cap-add SYS_ADMIN --shm-size 64G symbioticlab/zeus:latest bash
8+
docker run -it --gpus all --cap-add SYS_ADMIN --shm-size 64G symbioticlab/zeus:latest bash
99
```
1010

1111

@@ -24,17 +24,25 @@ conda install -c pytorch pytorch==1.10.1 cudatoolkit==11.3.1
2424

2525
### Install `zeus`
2626

27-
The standard command is:
27+
To install the latest release of `zeus`:
2828

2929
```bash
30-
# Working directory is repository root
30+
pip install zeus-ml
31+
```
32+
33+
If you would like to follow the `HEAD`:
34+
35+
```bash
36+
git clone https://github.com/SymbioticLab/Zeus.git zeus
37+
cd zeus
3138
pip install .
3239
```
3340

34-
For those would like to make changes to the source code and run them, we suggest an editable install:
41+
For those would like to make changes to the source code and run them, we suggest an editable installation:
3542

3643
```bash
37-
# Working directory is repository root
44+
git clone https://github.com/SymbioticLab/Zeus.git zeus
45+
cd zeus
3846
pip install -e .
3947
```
4048

‎docs/index.md

+4-4
Original file line numberDiff line numberDiff line change
@@ -22,10 +22,10 @@ Zeus is part of [The ML.ENERGY Initiative](https://ml.energy){.external}.
2222
Refer to [Getting Started](getting_started/index.md) for instructions on environment setup, installation, and integration.
2323
We also provide integration examples:
2424

25-
- Integrating Zeus with computer vision
26-
- [Integrating Zeus with CIFAR100 dataset](https://github.com/SymbioticLab/Zeus/tree/master/examples/cifar100){.external}
27-
- [Integrating Zeus with ImageNet dataset](https://github.com/SymbioticLab/Zeus/tree/master/examples/imagenet){.external}
28-
- [Integrating Zeus with NLP](https://github.com/SymbioticLab/Zeus/tree/master/examples/capriccio){.external}
25+
- Integrating Zeus with Computer Vision
26+
- [ImageNet](https://github.com/SymbioticLab/Zeus/tree/master/examples/imagenet){.external}
27+
- [CIFAR100](https://github.com/SymbioticLab/Zeus/tree/master/examples/cifar100){.external}
28+
- [Integrating Zeus with Natural Language Processing and Huggingface](https://github.com/SymbioticLab/Zeus/tree/master/examples/capriccio){.external}
2929
- [Running trace-driven simulation on single recurring jobs and the Alibaba GPU cluster trace](https://github.com/SymbioticLab/Zeus/tree/master/examples/trace_driven){.external}
3030

3131
## Extending Zeus

‎examples/capriccio/README.md

+4-4
Original file line numberDiff line numberDiff line change
@@ -21,7 +21,7 @@ While our paper is about optimizing the batch size and power limit over multiple
2121
### Dependencies
2222

2323
1. Generate Capriccio, following the instructions in [Capriccio's README.md](../../capriccio/).
24-
1. Install `zeus` and build the power monitor, following [Installing and Building](https://ml.energy/zeus/getting_started/installing_and_building/).
24+
1. If you're not using our [Docker image](https://ml.energy/zeus/getting_started/environment/), install `zeus` and build the power monitor, following [Installing and Building](https://ml.energy/zeus/getting_started/installing_and_building/).
2525
1. Install python dependencies for this example:
2626
```sh
2727
pip install -r requirements.txt
@@ -60,7 +60,7 @@ This example shows how to integrate [`ZeusDataLoader`](https://ml.energy/zeus/re
6060
### Dependencies
6161

6262
1. Generate Capriccio, following the instructions in [Capriccio's README.md](../../capriccio/).
63-
1. Install `zeus` and build the power monitor, following [Installing and Building](https://ml.energy/zeus/getting_started/installing_and_building/).
63+
1. If you're not using our [Docker image](https://ml.energy/zeus/getting_started/environment/), install `zeus` and build the power monitor, following [Installing and Building](https://ml.energy/zeus/getting_started/installing_and_building/).
6464
1. Install python dependencies for this example:
6565
```sh
6666
pip install -r requirements.txt
@@ -92,7 +92,7 @@ You can use Zeus's [`ProfileDataLoader`](https://ml.energy/zeus/reference/profil
9292
### Dependencies
9393
9494
1. Generate Capriccio, following the instructions in [Capriccio's README.md](../../capriccio/).
95-
1. Install `zeus` and build the power monitor, following [Installing and Building](https://ml.energy/zeus/getting_started/installing_and_building/).
95+
1. If you're not using our [Docker image](https://ml.energy/zeus/getting_started/environment/), install `zeus` and build the power monitor, following [Installing and Building](https://ml.energy/zeus/getting_started/installing_and_building/).
9696
1. Install python dependencies for this example:
9797
```sh
9898
pip install -r requirements.txt
@@ -129,7 +129,7 @@ At the same time, a CSV file with headers epoch number, split (train or eval), a
129129
### Dependencies
130130
131131
1. Generate Capriccio, following the instructions in [Capriccio's README.md](../../capriccio/).
132-
1. Only for those not using our Docker image, install PyTorch separately:
132+
1. Only for those not using our [Docker image](https://ml.energy/zeus/getting_started/environment/), install PyTorch separately:
133133
```sh
134134
conda install -c pytorch pytorch==1.10.1
135135
```

‎examples/cifar100/README.md

+16-8
Original file line numberDiff line numberDiff line change
@@ -22,10 +22,12 @@ While our paper is about optimizing the batch size and power limit over multiple
2222

2323
### Dependencies
2424

25+
All packages are pre-installed if you're using our [Docker image](https://ml.energy/zeus/getting_started/environment/).
26+
2527
1. Install `zeus` and build the power monitor, following [Installing and Building](https://ml.energy/zeus/getting_started/installing_and_building/).
26-
1. Install python dependencies for this example:
28+
1. Install `torchvision`:
2729
```sh
28-
pip install -r requirements.txt
30+
conda install -c pytorch torchvision==0.11.2
2931
```
3032

3133
### Example command
@@ -59,8 +61,10 @@ This example shows how to integrate [`ZeusDataLoader`](https://ml.energy/zeus/re
5961

6062
### Dependencies
6163

64+
All packages are pre-installed if you're using our [Docker image](https://ml.energy/zeus/getting_started/environment/).
65+
6266
1. Install `zeus` and build the power monitor, following [Installing and Building](https://ml.energy/zeus/getting_started/installing_and_building/).
63-
1. Only for those not using our Docker image, install `torchvision` separately:
67+
1. Install `torchvision`:
6468
```sh
6569
conda install -c pytorch torchvision==0.11.2
6670
```
@@ -88,8 +92,10 @@ You can use Zeus's [`ProfileDataLoader`](https://ml.energy/zeus/reference/profil
8892

8993
### Dependencies
9094

95+
All packages are pre-installed if you're using our [Docker image](https://ml.energy/zeus/getting_started/environment/).
96+
9197
1. Install `zeus` and build the power monitor, following [Installing and Building](https://ml.energy/zeus/getting_started/installing_and_building/).
92-
1. Only for those not using our Docker image, install `torchvision` separately:
98+
1. Install `torchvision`:
9399
```sh
94100
conda install -c pytorch torchvision==0.11.2
95101
```
@@ -123,10 +129,12 @@ At the same time, a CSV file with headers epoch number, split (`train` or `eval`
123129
124130
### Dependencies
125131
126-
Only for those not using our Docker image, install PyTorch, `torchvision`, and `cudatoolkit` separately:
127-
```sh
128-
conda install -c pytorch pytorch==1.10.1 torchvision==0.11.2 cudatoolkit==11.3.1
129-
```
132+
All packages are pre-installed if you're using our [Docker image](https://ml.energy/zeus/getting_started/environment/).
133+
134+
1. Install `torchvision`:
135+
```sh
136+
conda install -c pytorch torchvision==0.11.2
137+
```
130138

131139
### Example command
132140

‎examples/imagenet/README.md

+15-36
Original file line numberDiff line numberDiff line change
@@ -22,27 +22,15 @@ While our paper is about optimizing the batch size and power limit over multiple
2222

2323
### Dependencies
2424

25+
All packages are pre-installed if you're using our [Docker image](https://ml.energy/zeus/getting_started/environment/).
26+
You just need to download the ImageNet data and mount it to the Docker container with the `-v` option.
27+
2528
1. Download the ILSVRC2012 dataset from [the ImageNet homepage](http://www.image-net.org/).
2629
Then, extract archives using [this script](https://github.com/pytorch/examples/blob/main/imagenet/extract_ILSVRC.sh) provided by PyTorch.
27-
2. Install `zeus` and build the power monitor, following [Installing and Building](https://ml.energy/zeus/getting_started/installing_and_building/).
28-
- When spawning the container, mount the dataset to the container by specifying `-v $DATA_DIR:/data/imagenet`. The complete command will be:
29-
```sh
30-
# Working directory is repository root
31-
docker run -it \
32-
--gpus all `# Specify the number of GPUs to use. When 'all' is set, all the GPUs will be used.` \
33-
--cap-add SYS_ADMIN \
34-
--shm-size 64G \
35-
-v $(pwd):/workspace/zeus \
36-
-v $DATA_DIR:/data/imagenet `# Mount the dataset to the container` \
37-
symbioticlab/zeus:latest \
38-
bash
39-
```
40-
- Zeus will always use **ALL** the GPUs available to it. If you want to use specific GPUs on your node, please use the `CUDA_VISIBLE_DEVICES` environment variable, or use our Docker image and replace the argument following `--gpus` in the above `docker run` command with your preference. For example:
41-
- Mount 2 GPUs to the Docker container: `--gpus 2`.
42-
- Mount specific GPUs to the Docker container: `--gpus '"device=0,1"'`.
43-
3. Install python dependencies for this example:
30+
1. Install `zeus` and build the power monitor, following [Installing and Building](https://ml.energy/zeus/getting_started/installing_and_building/).
31+
1. Install `torchvision`:
4432
```sh
45-
pip install -r requirements.txt
33+
conda install -c pytorch torchvision==0.11.2
4634
```
4735

4836
### Example command
@@ -85,25 +73,13 @@ This example shows how to integrate [`ZeusDataLoader`](https://ml.energy/zeus/re
8573
8674
### Dependencies
8775
76+
All packages are pre-installed if you're using our [Docker image](https://ml.energy/zeus/getting_started/environment/).
77+
You just need to download the ImageNet data and mount it to the Docker container with the `-v` option.
78+
8879
1. Download the ILSVRC2012 dataset from [the ImageNet homepage](http://www.image-net.org/).
8980
Then, extract archives using [this script](https://github.com/pytorch/examples/blob/main/imagenet/extract_ILSVRC.sh) provided by PyTorch.
90-
2. Install `zeus` and build the power monitor, following [Installing and Building](https://ml.energy/zeus/getting_started/installing_and_building/).
91-
- When spawning the container, mount the dataset to the container by specifying `-v $DATA_DIR:/data/imagenet`. The complete command will be:
92-
```sh
93-
# Working directory is repository root
94-
docker run -it \
95-
--gpus all `# Specify the number of GPUs to use. When 'all' is set, all the GPUs will be used.` \
96-
--cap-add SYS_ADMIN \
97-
--shm-size 64G \
98-
-v $(pwd):/workspace/zeus \
99-
-v $DATA_DIR:/data/imagenet `# Mount the dataset to the container` \
100-
symbioticlab/zeus:latest \
101-
bash
102-
```
103-
- Zeus will always use **ALL** the GPUs available to it. If you want to use specific GPUs on your node, please use the `CUDA_VISIBLE_DEVICES` environment variable, or use our Docker image and replace the argument following `--gpus` in the above `docker run` command with your preference. For example:
104-
- Mount 2 GPUs to the Docker container: `--gpus 2`.
105-
- Mount specific GPUs to the Docker container: `--gpus '"device=0,1"'`.
106-
3. Only for those not using our Docker image, install `torchvision` separately:
81+
1. Install `zeus` and build the power monitor, following [Installing and Building](https://ml.energy/zeus/getting_started/installing_and_building/).
82+
1. Install `torchvision`:
10783
```sh
10884
conda install -c pytorch torchvision==0.11.2
10985
```
@@ -133,9 +109,12 @@ python run_zeus.py \
133109
134110
### Dependencies
135111
112+
All packages are pre-installed if you're using our [Docker image](https://ml.energy/zeus/getting_started/environment/).
113+
You just need to download the ImageNet data and mount it to the Docker container with the `-v` option.
114+
136115
1. Download the ILSVRC2012 dataset from [the ImageNet homepage](http://www.image-net.org/).
137116
Then, extract archives using [this script](https://github.com/pytorch/examples/blob/main/imagenet/extract_ILSVRC.sh) provided by PyTorch.
138-
2. Only for those not using our Docker image, install PyTorch, `torchvision`, and `cudatoolkit` separately:
117+
2. Install PyTorch, `torchvision`, and `cudatoolkit`:
139118
```sh
140119
conda install -c pytorch pytorch==1.10.1 torchvision==0.11.2 cudatoolkit==11.3.1
141120
```

‎mkdocs.yml

+2-2
Original file line numberDiff line numberDiff line change
@@ -101,8 +101,8 @@ nav:
101101
- overview/index.md
102102
- Getting Started:
103103
- getting_started/index.md
104-
- Environment setup: getting_started/environment.md
105-
- Installing and building: getting_started/installing_and_building.md
104+
- Environment Setup: getting_started/environment.md
105+
- Installing and Building: getting_started/installing_and_building.md
106106
- Extending Zeus: extend.md
107107
- Source Code Reference: reference/
108108

‎zeus/analyze.py

+2-1
Original file line numberDiff line numberDiff line change
@@ -98,7 +98,8 @@ def avg_power(
9898
end: End time of the window to consider.
9999
100100
Raises:
101-
ValueError: From `sklearn.metrics.auc`. May mean that the profile window is too small.
101+
ValueError: From `sklearn.metrics.auc`, when the duration of the
102+
profiling window is too small.
102103
"""
103104
df = cast(pd.DataFrame, pd.read_csv(logfile, engine="python", skipfooter=1))
104105
df["Time"] = pd.to_datetime(df["Time"])

‎zeus/run/dataloader.py

+34-23
Original file line numberDiff line numberDiff line change
@@ -80,21 +80,27 @@ class ZeusDataLoader(DataLoader):
8080
8181
## Data parallel with multi-GPU on a single-node
8282
83-
Zeus supports only one process per GPU profiling. In data parallel training,
84-
each process has its `local_rank` within the node and will run the
85-
following code. Please refer to [the integration example with ImageNet]
86-
(https://github.com/SymbioticLab/Zeus/tree/master/examples/imagenet/train.py)
87-
for the complete example.
83+
!!! Important
84+
Zeus assumes that exactly one process manages one GPU, and hence
85+
one instance of [`ZeusDataLoader`][zeus.run.ZeusDataLoader] exists
86+
for each GPU.
87+
88+
Users can integrate Zeus into existing data parallel training scripts
89+
with five specific steps, which are noted below in the comments.
90+
91+
Please refer to
92+
[our integration example with ImageNet](https://github.com/SymbioticLab/Zeus/tree/master/examples/imagenet/train.py)
93+
for a complete example.
8894
8995
```python
9096
import torch
97+
import torch.distributed as dist
9198
import torchvision
9299
93100
from zeus.run import ZeusDataLoader
94101
95102
# Step 1: Initialize the default process group.
96-
# Make sure to call `init_process_group` before calling the constructor of
97-
# `ZeusDataLoader`.
103+
# This should be done before instantiating `ZeusDataLoader`.
98104
dist.init_process_group(
99105
backend=args.dist_backend,
100106
init_method=args.dist_url,
@@ -104,40 +110,43 @@ class ZeusDataLoader(DataLoader):
104110
model = torchvision.models.resnet18()
105111
torch.cuda.set_device(local_rank)
106112
model.cuda(local_rank)
107-
# Zeus only supports one process per GPU profiling. If you are doing data
113+
# Zeus assumes that exactly one process manages one GPU. If you are doing data
108114
# parallel training, please use `DistributedDataParallel` for model replication
109-
# and specify the `device_ids` and `output_device` correctly.
115+
# and specify the `device_ids` and `output_device` as below:
110116
model = torch.nn.parallel.DistributedDataParallel(
111117
model,
112118
device_ids=[local_rank],
113119
output_device=local_rank,
114120
)
115121
116122
# Step 3: Create instances of `DistributedSampler` to partition the dataset
117-
# across the GPUs.
123+
# across the GPUs.
118124
train_sampler = torch.utils.data.distributed.DistributedSampler(train_set)
119125
eval_sampler = torch.utils.data.distributed.DistributedSampler(eval_set)
120126
121-
# Step 4: Create instances of `ZeusDataLoader`.
122-
# Pass `"dp"` to `distributed` and samplers in the previous step to
123-
# `sampler`.
127+
# Step 4: Instantiate `ZeusDataLoader`.
128+
# `distributed="dp"` tells `ZeusDataLoader` to operate in data parallel mode.
124129
# The one instantiated with `max_epochs` becomes the train dataloader.
125130
train_loader = ZeusDataLoader(train_set, batch_size=256, max_epochs=100,
126-
sampler=train_sampler, distributed="dp")
131+
sampler=train_sampler, distributed="dp")
127132
eval_loader = ZeusDataLoader(eval_set, batch_size=256, sampler=eval_sampler,
128-
distributed="dp")
133+
distributed="dp")
129134
130-
# Step 5: Put your training code here.
135+
# Step 5: Training loop.
136+
# Use the train dataloader's `epochs` generator to allow Zeus to early-stop
137+
# based on the cost. Use `report_metric` to let Zeus know the current
138+
# validation metric.
131139
for epoch_number in train_loader.epochs():
132140
for batch in train_loader:
133141
# Learn from batch
134142
for batch in eval_loader:
135143
# Evaluate on batch
136144
137-
# If doing data parallel training, please make sure to call
138-
# `torch.distributed.all_reduce()` to reduce the validation metric
139-
# across all GPUs before calling `train_loader.report_metric()`.
140-
train_loader.report_metric(validation_metric)
145+
# Make sure you all-reduce the validation metric across all GPUs,
146+
# since Zeus expects the final validation metric.
147+
val_metric_tensor = torch.tensor([validation_metric], device="cuda")
148+
dist.all_reduce(val_metric_tensor, async_op=False)
149+
train_loader.report_metric(val_metric_tensor.item())
141150
```
142151
143152
# Environment variables
@@ -474,11 +483,13 @@ def epochs(self) -> Generator[int, None, None]:
474483
Epoch indices starting from zero.
475484
476485
Raises:
477-
ZeusCostThresholdExceededException: the predicated cost after next epoch exceeds the cost threshold.
478-
When doing data parallel training, this exception is used for ternimating ALL the processes.
486+
ZeusCostThresholdExceededException: the predicted cost after the next
487+
epoch exceeds the cost threshold. When doing data parallel training,
488+
this exception is used for ternimating all the processes.
479489
"""
480490
# Sanity check.
481-
assert self._is_train, "Use epochs() on the train dataloader."
491+
if not self._is_train:
492+
raise RuntimeError("Use epochs() on the train dataloader.")
482493

483494
while True:
484495
# Variables for storing time/energy consumption & cost

‎zeus/run/master.py

+3-3
Original file line numberDiff line numberDiff line change
@@ -126,7 +126,7 @@ def build_logdir(
126126
job: Job to run.
127127
num_recurrence: The total number of recurrences.
128128
eta_knob: $\eta$ used in the cost metric.
129-
$cost = \eta * ETA + (1 - \eta) * MaxPower * TTA$
129+
$\textrm{cost} = \eta \cdot \textrm{ETA} + (1 - \eta) \cdot \textrm{MaxPower} \cdot \textrm{TTA}$
130130
beta_knob: `beta_knob * min_cost` is the early stopping cost threshold.
131131
Set to `np.inf` to disable early stopping.
132132
exist_ok: Passed to `os.makedirs`. If `False`, will err if the directory
@@ -163,7 +163,7 @@ def run_job(
163163
rec_i: Recurrence number of this run of the job.
164164
tries: Retry number of this recurrence of the job.
165165
eta_knob: $\eta$ used in the cost metric.
166-
$cost = \eta * ETA + (1 - \eta) * MaxPower * TTA$
166+
$\textrm{cost} = \eta \cdot \textrm{ETA} + (1 - \eta) \cdot \textrm{MaxPower} \cdot \textrm{TTA}$
167167
cost_ub: Cost upper bound. The job is terminated when the next epoch is going
168168
to exceed the cost upper bound.
169169
@@ -261,7 +261,7 @@ def run(
261261
beta_knob: `beta_knob * min_eta` is the early stopping cost threshold.
262262
Set to `np.inf` to disable early stopping.
263263
eta_knob: $\eta$ used in the cost metric.
264-
$cost = \eta * ETA + (1 - \eta) * MaxPower * TTA$
264+
$\textrm{cost} = \eta \cdot \textrm{ETA} + (1 - \eta) \cdot \textrm{MaxPower} \cdot \textrm{TTA}$
265265
266266
Returns:
267267
A list of [`HistoryEntry`][zeus.analyze.HistoryEntry] objects for each job run.

‎zeus/simulate.py

+3-3
Original file line numberDiff line numberDiff line change
@@ -97,7 +97,7 @@ def simulate_one_job(
9797
beta_knob: `beta_knob * min_eta` is the early stopping cost threshold.
9898
Set to `np.inf` to disable early stopping.
9999
eta_knob: $\eta$ used in the hybrid cost metric.
100-
$cost = \eta * ETA + (1 - \eta) * MaxPower * TTA$
100+
$\textrm{cost} = \eta \cdot \textrm{ETA} + (1 - \eta) \cdot \textrm{MaxPower} \cdot \textrm{TTA}$
101101
102102
Returns:
103103
A list of [`HistoryEntry`][zeus.analyze.HistoryEntry] objects for each job run.
@@ -244,7 +244,7 @@ def simulate_one_alibaba_group(
244244
beta_knob: `beta_knob * min_eta` is the early stopping cost threshold.
245245
Set to `np.inf` to disable early stopping.
246246
eta_knob: $\eta$ used in the hybrid cost metric.
247-
$cost = \eta * ETA + (1 - \eta) * MaxPower * TTA$
247+
$\textrm{cost} = \eta \cdot \textrm{ETA} + (1 - \eta) \cdot \textrm{MaxPower} \cdot \textrm{TTA}$
248248
249249
Returns:
250250
A list of [`HistoryEntry`][zeus.analyze.HistoryEntry] objects for each job run.
@@ -681,7 +681,7 @@ def _run_job(
681681
cost_ub: Cost upper bound. The job is terminated when the next epoch is going
682682
to exceed the cost upper bound.
683683
eta_knob: $\eta$ used in the hybrid cost metric.
684-
$cost = \eta * ETA + (1 - \eta) * MaxPower * TTA$
684+
$\textrm{cost} = \eta \cdot \textrm{ETA} + (1 - \eta) \cdot \textrm{MaxPower} \cdot \textrm{TTA}$
685685
profile_power: Whether this run of the job should profile power during the
686686
first epoch.
687687

‎zeus/util/metric.py

+5-5
Original file line numberDiff line numberDiff line change
@@ -48,11 +48,11 @@ class ZeusCostThresholdExceededException(Exception):
4848
are still alive.
4949
5050
Attributes:
51-
time_consumed: Time consumed till the current epoch.
52-
energy_consumed: Energy consumed till the current epoch.
53-
cost: Computed Zeus's energy-time cost metric till the current epoch.
54-
next_cost: Predicted Zeus's energy-time cost metric after next epoch.
55-
cost_thresh: The cost threshold.
51+
time_consumed (float): Time consumed until the current epoch.
52+
energy_consumed (float): Energy consumed until the current epoch.
53+
cost (float): Computed Zeus's energy-time cost metric until the current epoch.
54+
next_cost (float): Predicted Zeus's energy-time cost metric after next epoch.
55+
cost_thresh (float): The cost threshold.
5656
"""
5757

5858
def __init__(

‎zeus_monitor/README.md

+2-1
Original file line numberDiff line numberDiff line change
@@ -14,9 +14,10 @@ Usage: ./zeus_monitor LOGFILE DURATION SLEEP_MS [GPU_IDX]
1414

1515
## Building
1616

17+
The Zeus monitor is pre-built for you if you're using our [Docker image](https://ml.energy/zeus/getting_started/environment/).
18+
1719
### Dependencies
1820

19-
All dependencies are pre-installed if you're using our Docker image.
2021
1. CMake >= 3.22
2122
1. CUDAToolkit, especially NVML (`libnvidia-ml.so`)
2223

0 commit comments

Comments
 (0)
Please sign in to comment.