MachineLearningSystem · Oct 8, 2022
diff --git a/‎Dockerfile
+3 b/‎Dockerfile
+3
diff --git a/‎docs/getting_started/environment.md
+9-6 b/‎docs/getting_started/environment.md
+9-6
diff --git a/‎docs/getting_started/index.md
+34-23 b/‎docs/getting_started/index.md
+34-23
diff --git a/‎docs/getting_started/installing_and_building.md
+13-5 b/‎docs/getting_started/installing_and_building.md
+13-5
diff --git a/‎docs/index.md
+4-4 b/‎docs/index.md
+4-4
diff --git a/‎examples/capriccio/README.md
+4-4 b/‎examples/capriccio/README.md
+4-4
diff --git a/‎examples/cifar100/README.md
+16-8 b/‎examples/cifar100/README.md
+16-8
diff --git a/‎examples/imagenet/README.md
+15-36 b/‎examples/imagenet/README.md
+15-36
diff --git a/‎mkdocs.yml
+2-2 b/‎mkdocs.yml
+2-2
diff --git a/‎zeus/analyze.py
+2-1 b/‎zeus/analyze.py
+2-1
diff --git a/‎zeus/run/dataloader.py
+34-23 b/‎zeus/run/dataloader.py
+34-23
diff --git a/‎zeus/run/master.py
+3-3 b/‎zeus/run/master.py
+3-3
diff --git a/‎zeus/simulate.py
+3-3 b/‎zeus/simulate.py
+3-3
diff --git a/‎zeus/util/metric.py
+5-5 b/‎zeus/util/metric.py
+5-5
diff --git a/‎zeus_monitor/README.md
+2-1 b/‎zeus_monitor/README.md
+2-1
@@ -50,3 +50,6 @@ ADD . /workspace/zeus
 
 # When an outside zeus directory is mounted, have it apply immediately.
 RUN pip install -e zeus
+
+# Build and bake in the Zeus monitor.
+RUN cd /workspace/zeus/zeus_monitor && cmake . && make && cp zeus_monitor /usr/local/bin/ && cd /workspace
@@ -4,7 +4,7 @@ We encourage users to do everything inside a Docker container spawned with our p
 
 ## Zeus Docker image
 
-We provide a pre-built Docker image in Docker Hub: https://hub.docker.com/r/symbioticlab/zeus.
+We provide a pre-built Docker image in [Docker Hub](https://hub.docker.com/r/symbioticlab/zeus){.external}.
 On top of the `nvidia/cuda:11.3.1-devel-ubuntu20.04` image, the following are provided:  
 
 1. CMake 3.22.0
@@ -39,15 +39,18 @@ docker run -it \
 2. `SYS_ADMIN` capability is needed to manage the power configurations of the GPU via NVML.
 3. PyTorch DataLoader workers need enough shared memory for IPC. If the PyTorch training process dies with a Bus error, consider increasing this even more.
 
-If you would like your changes to `zeus/` outside the container to be immediately applied inside the container, mount the repository into the container:
+Use the `-v` option to mount outside data into the container.
+For instance, if you would like your changes to `zeus/` outside the container to be immediately applied inside the container, mount the repository into the container.
+You can also mount training data into the container.
 
 ``` { .sh .annotate }
 # Working directory is repository root
 docker run -it \
-    --gpus all \                      # (1)!
-    --cap-add SYS_ADMIN \           # (2)!
-    --shm-size 64G \              # (3)!
-    -v $(pwd):/workspace/zeus \ # (4)!
+    --gpus all \                               # (1)!
+    --cap-add SYS_ADMIN \                    # (2)!
+    --shm-size 64G \                       # (3)!
+    -v $(pwd):/workspace/zeus \          # (4)!
+    -v /data/imagenet:/data/imagenet:ro \
     symbioticlab/zeus:latest \
     bash
 ```
 
@@ -40,20 +40,28 @@ for epoch_number in train_loader.epochs():
 ```
 
 ### Data parallel with multi-GPU on a single-node
-Zeus supports only one process per GPU profiling. In data parallel training,
-each process has its `local_rank` within the node and will run the
-following code.
-We also specify the important steps for a better comprehension.
-Please refer to [the integration example with ImageNet](https://github.com/SymbioticLab/Zeus/tree/master/examples/imagenet/train.py)
-for the complete example.
+
+!!! Important
+    Zeus assumes that exactly one process manages one GPU, and hence
+    one instance of [`ZeusDataLoader`][zeus.run.ZeusDataLoader] exists
+    for each GPU.
+
+Users can integrate Zeus into existing data parallel training scripts
+with five specific steps, which are noted below in the comments.
+
+Please refer to
+[our integration example with ImageNet](https://github.com/SymbioticLab/Zeus/tree/master/examples/imagenet/train.py)
+for a complete example.
 
 ```python
 import torch
+import torch.distributed as dist
 import torchvision
 
 from zeus.run import ZeusDataLoader
 
 # Step 1: Initialize the default process group.
+# This should be done before instantiating `ZeusDataLoader`.
 dist.init_process_group(
     backend=args.dist_backend,
     init_method=args.dist_url,
@@ -63,9 +71,9 @@ dist.init_process_group(
 model = torchvision.models.resnet18()
 torch.cuda.set_device(local_rank)
 model.cuda(local_rank)
-# Zeus only supports one process per GPU profiling. If you are doing data
+# Zeus assumes that exactly one process manages one GPU. If you are doing data
 # parallel training, please use `DistributedDataParallel` for model replication
-# and specify the `device_ids` and `output_device` correctly.
+# and specify the `device_ids` and `output_device` as below:
 model = torch.nn.parallel.DistributedDataParallel(
     model,
     device_ids=[local_rank],
@@ -77,47 +85,50 @@ model = torch.nn.parallel.DistributedDataParallel(
 train_sampler = torch.utils.data.distributed.DistributedSampler(train_set)
 eval_sampler = torch.utils.data.distributed.DistributedSampler(eval_set)
 
-# Step 4: Create instances of `ZeusDataLoader`.
-# Pass "dp" to `distributed` and samplers in the previous step to
-# `sampler`.
+# Step 4: Instantiate `ZeusDataLoader`.
+# `distributed="dp"` tells `ZeusDataLoader` to operate in data parallel mode.
 # The one instantiated with `max_epochs` becomes the train dataloader.
 train_loader = ZeusDataLoader(train_set, batch_size=256, max_epochs=100, 
                               sampler=train_sampler, distributed="dp")
 eval_loader = ZeusDataLoader(eval_set, batch_size=256, sampler=eval_sampler,
                              distributed="dp")
 
-# Step 5: Put your training code here.
+# Step 5: Training loop.
+# Use the train dataloader's `epochs` generator to allow Zeus to early-stop
+# based on the training cost. Use `report_metric` to let Zeus know the current
+# validation metric.
 for epoch_number in train_loader.epochs():
     for batch in train_loader:
         # Learn from batch
     for batch in eval_loader:
         # Evaluate on batch
 
-    # If doing data parallel training, please make sure to call 
-    # `torch.distributed.all_reduce()` to reduce the validation metric 
-    # across all GPUs before calling `train_loader.report_metric()`.
-    train_loader.report_metric(validation_metric)
+    # Make sure you all-reduce the validation metric across all GPUs,
+    # since Zeus expects the final validation metric.
+    val_metric_tensor = torch.tensor([validation_metric], device="cuda")
+    dist.all_reduce(val_metric_tensor, async_op=False)
+    train_loader.report_metric(val_metric_tensor.item())
 ```
 
 The following examples will help:
 
 - Integrating Zeus with computer vision
-  - [Integrating Zeus with CIFAR100 dataset](https://github.com/SymbioticLab/Zeus/tree/master/examples/cifar100){.external}
-  - [Integrating Zeus with ImageNet dataset](https://github.com/SymbioticLab/Zeus/tree/master/examples/imagenet){.external}
+    - [Integrating Zeus with CIFAR100 dataset](https://github.com/SymbioticLab/Zeus/tree/master/examples/cifar100){.external}
+    - [Integrating Zeus with ImageNet dataset](https://github.com/SymbioticLab/Zeus/tree/master/examples/imagenet){.external}
 - [Integrating Zeus with NLP](https://github.com/SymbioticLab/Zeus/tree/master/examples/capriccio){.external}
 
 
 ## Recurring jobs
 
-The optimal batch size is explored *across* multiple job runs using a Multi-Armed Bandit algorithm.
+!!! Info
+    We plan to integrate [`ZeusMaster`][zeus.run.ZeusMaster] with an MLOps platform like [KubeFlow](https://www.kubeflow.org/).
+    Let us know about your preferences, use cases, and expectations by [posting an issue](https://github.com/SymbioticLab/Zeus/issues/new?assignees=&labels=&template=feature_request.md&title=Regarding%20Integration%20with%20MLOps%20Platroms)!
+
+The cost-optimal batch size is located *across* multiple job runs using a Multi-Armed Bandit algorithm.
 First, go through the steps for non-recurring jobs. 
 [`ZeusDataLoader`][zeus.run.ZeusDataLoader] will transparently optimize the GPU power limit for any given batch size.
 Then, you can use [`ZeusMaster`][zeus.run.ZeusMaster] to drive recurring jobs and batch size optimization.
 
 This example will come in handy:
 
 - [Running trace-driven simulation on single recurring jobs and the Alibaba GPU cluster trace](https://github.com/SymbioticLab/Zeus/tree/master/examples/trace_driven){.external}
-
-!!! Info
-    We plan to integrate [`ZeusMaster`][zeus.run.ZeusMaster] with an MLOps platform like [KubeFlow](https://www.kubeflow.org/).
-    Feel free to let us know about your preferences, use cases, and expectations.
 
@@ -5,7 +5,7 @@ This document explains how to install the [`zeus`][zeus] Python package and how
 !!! Tip
     We encourage users to utilize our Docker image. Please refer to [Environment setup](./environment.md). Quick command:
     ```bash
-    docker run -it --gpus 1 --cap-add SYS_ADMIN --shm-size 64G symbioticlab/zeus:latest bash
+    docker run -it --gpus all --cap-add SYS_ADMIN --shm-size 64G symbioticlab/zeus:latest bash
     ```
 
 
@@ -24,17 +24,25 @@ conda install -c pytorch pytorch==1.10.1 cudatoolkit==11.3.1
 
 ### Install `zeus`
 
-The standard command is:
+To install the latest release of `zeus`:
 
 ```bash
-# Working directory is repository root
+pip install zeus-ml
+```
+
+If you would like to follow the `HEAD`:
+
+```bash
+git clone https://github.com/SymbioticLab/Zeus.git zeus
+cd zeus
 pip install .
 ```
 
-For those would like to make changes to the source code and run them, we suggest an editable install:
+For those would like to make changes to the source code and run them, we suggest an editable installation:
 
 ```bash
-# Working directory is repository root
+git clone https://github.com/SymbioticLab/Zeus.git zeus
+cd zeus
 pip install -e .
 ```
 
 
@@ -22,10 +22,10 @@ Zeus is part of [The ML.ENERGY Initiative](https://ml.energy){.external}.
 Refer to [Getting Started](getting_started/index.md) for instructions on environment setup, installation, and integration.
 We also provide integration examples:
 
-- Integrating Zeus with computer vision
-  - [Integrating Zeus with CIFAR100 dataset](https://github.com/SymbioticLab/Zeus/tree/master/examples/cifar100){.external}
-  - [Integrating Zeus with ImageNet dataset](https://github.com/SymbioticLab/Zeus/tree/master/examples/imagenet){.external}
-- [Integrating Zeus with NLP](https://github.com/SymbioticLab/Zeus/tree/master/examples/capriccio){.external}
+- Integrating Zeus with Computer Vision
+    - [ImageNet](https://github.com/SymbioticLab/Zeus/tree/master/examples/imagenet){.external}
+    - [CIFAR100](https://github.com/SymbioticLab/Zeus/tree/master/examples/cifar100){.external}
+- [Integrating Zeus with Natural Language Processing and Huggingface](https://github.com/SymbioticLab/Zeus/tree/master/examples/capriccio){.external}
 - [Running trace-driven simulation on single recurring jobs and the Alibaba GPU cluster trace](https://github.com/SymbioticLab/Zeus/tree/master/examples/trace_driven){.external}
 
 ## Extending Zeus
 
@@ -21,7 +21,7 @@ While our paper is about optimizing the batch size and power limit over multiple
 ### Dependencies
 
 1. Generate Capriccio, following the instructions in [Capriccio's README.md](../../capriccio/).
-1. Install `zeus` and build the power monitor, following [Installing and Building](https://ml.energy/zeus/getting_started/installing_and_building/).
+1. If you're not using our [Docker image](https://ml.energy/zeus/getting_started/environment/), install `zeus` and build the power monitor, following [Installing and Building](https://ml.energy/zeus/getting_started/installing_and_building/).
 1. Install python dependencies for this example:
     ```sh
     pip install -r requirements.txt
@@ -60,7 +60,7 @@ This example shows how to integrate [`ZeusDataLoader`](https://ml.energy/zeus/re
 ### Dependencies
 
 1. Generate Capriccio, following the instructions in [Capriccio's README.md](../../capriccio/).
-1. Install `zeus` and build the power monitor, following [Installing and Building](https://ml.energy/zeus/getting_started/installing_and_building/).
+1. If you're not using our [Docker image](https://ml.energy/zeus/getting_started/environment/), install `zeus` and build the power monitor, following [Installing and Building](https://ml.energy/zeus/getting_started/installing_and_building/).
 1. Install python dependencies for this example:
     ```sh
     pip install -r requirements.txt
@@ -92,7 +92,7 @@ You can use Zeus's [`ProfileDataLoader`](https://ml.energy/zeus/reference/profil
 ### Dependencies
 
 1. Generate Capriccio, following the instructions in [Capriccio's README.md](../../capriccio/).
-1. Install `zeus` and build the power monitor, following [Installing and Building](https://ml.energy/zeus/getting_started/installing_and_building/).
+1. If you're not using our [Docker image](https://ml.energy/zeus/getting_started/environment/), install `zeus` and build the power monitor, following [Installing and Building](https://ml.energy/zeus/getting_started/installing_and_building/).
 1. Install python dependencies for this example:
     ```sh
     pip install -r requirements.txt
@@ -129,7 +129,7 @@ At the same time, a CSV file with headers epoch number, split (train or eval), a
 ### Dependencies
 
 1. Generate Capriccio, following the instructions in [Capriccio's README.md](../../capriccio/).
-1. Only for those not using our Docker image, install PyTorch separately:
+1. Only for those not using our [Docker image](https://ml.energy/zeus/getting_started/environment/), install PyTorch separately:
     ```sh
     conda install -c pytorch pytorch==1.10.1
     ```
 
@@ -22,10 +22,12 @@ While our paper is about optimizing the batch size and power limit over multiple
 
 ### Dependencies
 
+All packages are pre-installed if you're using our [Docker image](https://ml.energy/zeus/getting_started/environment/).
+
 1. Install `zeus` and build the power monitor, following [Installing and Building](https://ml.energy/zeus/getting_started/installing_and_building/).
-1. Install python dependencies for this example:
+1. Install `torchvision`:
     ```sh
-    pip install -r requirements.txt
+    conda install -c pytorch torchvision==0.11.2
     ```
 
 ### Example command
@@ -59,8 +61,10 @@ This example shows how to integrate [`ZeusDataLoader`](https://ml.energy/zeus/re
 
 ### Dependencies
 
+All packages are pre-installed if you're using our [Docker image](https://ml.energy/zeus/getting_started/environment/).
+
 1. Install `zeus` and build the power monitor, following [Installing and Building](https://ml.energy/zeus/getting_started/installing_and_building/).
-1. Only for those not using our Docker image, install `torchvision` separately:
+1. Install `torchvision`:
     ```sh
     conda install -c pytorch torchvision==0.11.2
     ```
@@ -88,8 +92,10 @@ You can use Zeus's [`ProfileDataLoader`](https://ml.energy/zeus/reference/profil
 
 ### Dependencies
 
+All packages are pre-installed if you're using our [Docker image](https://ml.energy/zeus/getting_started/environment/).
+
 1. Install `zeus` and build the power monitor, following [Installing and Building](https://ml.energy/zeus/getting_started/installing_and_building/).
-1. Only for those not using our Docker image, install `torchvision` separately:
+1. Install `torchvision`:
     ```sh
     conda install -c pytorch torchvision==0.11.2
     ```
@@ -123,10 +129,12 @@ At the same time, a CSV file with headers epoch number, split (`train` or `eval`
 
 ### Dependencies
 
-Only for those not using our Docker image, install PyTorch, `torchvision`, and `cudatoolkit` separately:
-```sh
-conda install -c pytorch pytorch==1.10.1 torchvision==0.11.2 cudatoolkit==11.3.1
-```
+All packages are pre-installed if you're using our [Docker image](https://ml.energy/zeus/getting_started/environment/).
+
+1. Install `torchvision`:
+    ```sh
+    conda install -c pytorch torchvision==0.11.2
+    ```
 
 ### Example command
 
 
@@ -22,27 +22,15 @@ While our paper is about optimizing the batch size and power limit over multiple
 
 ### Dependencies
 
+All packages are pre-installed if you're using our [Docker image](https://ml.energy/zeus/getting_started/environment/).
+You just need to download the ImageNet data and mount it to the Docker container with the `-v` option.
+
 1. Download the ILSVRC2012 dataset from [the ImageNet homepage](http://www.image-net.org/).
     Then, extract archives using [this script](https://github.com/pytorch/examples/blob/main/imagenet/extract_ILSVRC.sh) provided by PyTorch.
-2. Install `zeus` and build the power monitor, following [Installing and Building](https://ml.energy/zeus/getting_started/installing_and_building/).
-   - When spawning the container, mount the dataset to the container by specifying `-v $DATA_DIR:/data/imagenet`. The complete command will be:
-    ```sh
-    # Working directory is repository root
-    docker run -it \
-        --gpus all    `# Specify the number of GPUs to use. When 'all' is set, all the GPUs will be used.` \
-        --cap-add SYS_ADMIN \
-        --shm-size 64G \              
-        -v $(pwd):/workspace/zeus \
-        -v $DATA_DIR:/data/imagenet    `# Mount the dataset to the container` \
-        symbioticlab/zeus:latest \
-        bash
-    ```
-    - Zeus will always use **ALL** the GPUs available to it. If you want to use specific GPUs on your node, please use the `CUDA_VISIBLE_DEVICES` environment variable, or use our Docker image and replace the argument following `--gpus` in the above `docker run` command with your preference. For example:
-      - Mount 2 GPUs to the Docker container: `--gpus 2`.
-      - Mount specific GPUs to the Docker container: `--gpus '"device=0,1"'`.
-3. Install python dependencies for this example:
+1. Install `zeus` and build the power monitor, following [Installing and Building](https://ml.energy/zeus/getting_started/installing_and_building/).
+1. Install `torchvision`:
     ```sh
-    pip install -r requirements.txt
+    conda install -c pytorch torchvision==0.11.2
     ```
 
 ### Example command
@@ -85,25 +73,13 @@ This example shows how to integrate [`ZeusDataLoader`](https://ml.energy/zeus/re
 
 ### Dependencies
 
+All packages are pre-installed if you're using our [Docker image](https://ml.energy/zeus/getting_started/environment/).
+You just need to download the ImageNet data and mount it to the Docker container with the `-v` option.
+
 1. Download the ILSVRC2012 dataset from [the ImageNet homepage](http://www.image-net.org/).
     Then, extract archives using [this script](https://github.com/pytorch/examples/blob/main/imagenet/extract_ILSVRC.sh) provided by PyTorch.
-2. Install `zeus` and build the power monitor, following [Installing and Building](https://ml.energy/zeus/getting_started/installing_and_building/).
-   - When spawning the container, mount the dataset to the container by specifying `-v $DATA_DIR:/data/imagenet`. The complete command will be:
-    ```sh
-    # Working directory is repository root
-    docker run -it \
-        --gpus all    `# Specify the number of GPUs to use. When 'all' is set, all the GPUs will be used.` \
-        --cap-add SYS_ADMIN \
-        --shm-size 64G \              
-        -v $(pwd):/workspace/zeus \
-        -v $DATA_DIR:/data/imagenet    `# Mount the dataset to the container` \
-        symbioticlab/zeus:latest \
-        bash
-    ```
-    - Zeus will always use **ALL** the GPUs available to it. If you want to use specific GPUs on your node, please use the `CUDA_VISIBLE_DEVICES` environment variable, or use our Docker image and replace the argument following `--gpus` in the above `docker run` command with your preference. For example:
-      - Mount 2 GPUs to the Docker container: `--gpus 2`.
-      - Mount specific GPUs to the Docker container: `--gpus '"device=0,1"'`.
-3. Only for those not using our Docker image, install `torchvision` separately:
+1. Install `zeus` and build the power monitor, following [Installing and Building](https://ml.energy/zeus/getting_started/installing_and_building/).
+1. Install `torchvision`:
     ```sh
     conda install -c pytorch torchvision==0.11.2
     ```
@@ -133,9 +109,12 @@ python run_zeus.py \
 
 ### Dependencies
 
+All packages are pre-installed if you're using our [Docker image](https://ml.energy/zeus/getting_started/environment/).
+You just need to download the ImageNet data and mount it to the Docker container with the `-v` option.
+
 1. Download the ILSVRC2012 dataset from [the ImageNet homepage](http://www.image-net.org/).
     Then, extract archives using [this script](https://github.com/pytorch/examples/blob/main/imagenet/extract_ILSVRC.sh) provided by PyTorch.
-2. Only for those not using our Docker image, install PyTorch, `torchvision`, and `cudatoolkit` separately:
+2. Install PyTorch, `torchvision`, and `cudatoolkit`:
     ```sh
     conda install -c pytorch pytorch==1.10.1 torchvision==0.11.2 cudatoolkit==11.3.1
     ```
 
@@ -101,8 +101,8 @@ nav:
     - overview/index.md
   - Getting Started:
     - getting_started/index.md
-    - Environment setup: getting_started/environment.md
-    - Installing and building: getting_started/installing_and_building.md
+    - Environment Setup: getting_started/environment.md
+    - Installing and Building: getting_started/installing_and_building.md
   - Extending Zeus: extend.md
   - Source Code Reference: reference/
 
 
@@ -98,7 +98,8 @@ def avg_power(
         end: End time of the window to consider.
 
     Raises:
-        ValueError: From `sklearn.metrics.auc`. May mean that the profile window is too small.
+        ValueError: From `sklearn.metrics.auc`, when the duration of the
+            profiling window is too small.
     """
     df = cast(pd.DataFrame, pd.read_csv(logfile, engine="python", skipfooter=1))
     df["Time"] = pd.to_datetime(df["Time"])
 
@@ -80,21 +80,27 @@ class ZeusDataLoader(DataLoader):
 
     ## Data parallel with multi-GPU on a single-node
 
-    Zeus supports only one process per GPU profiling. In data parallel training,
-    each process has its `local_rank` within the node and will run the
-    following code. Please refer to [the integration example with ImageNet]
-    (https://github.com/SymbioticLab/Zeus/tree/master/examples/imagenet/train.py)
-    for the complete example.
+    !!! Important
+        Zeus assumes that exactly one process manages one GPU, and hence
+        one instance of [`ZeusDataLoader`][zeus.run.ZeusDataLoader] exists
+        for each GPU.
+
+    Users can integrate Zeus into existing data parallel training scripts
+    with five specific steps, which are noted below in the comments.
+
+    Please refer to
+    [our integration example with ImageNet](https://github.com/SymbioticLab/Zeus/tree/master/examples/imagenet/train.py)
+    for a complete example.
 
     ```python
     import torch
+    import torch.distributed as dist
     import torchvision
 
     from zeus.run import ZeusDataLoader
 
     # Step 1: Initialize the default process group.
-    # Make sure to call `init_process_group` before calling the constructor of
-    # `ZeusDataLoader`.
+    # This should be done before instantiating `ZeusDataLoader`.
     dist.init_process_group(
         backend=args.dist_backend,
         init_method=args.dist_url,
@@ -104,40 +110,43 @@ class ZeusDataLoader(DataLoader):
     model = torchvision.models.resnet18()
     torch.cuda.set_device(local_rank)
     model.cuda(local_rank)
-    # Zeus only supports one process per GPU profiling. If you are doing data
+    # Zeus assumes that exactly one process manages one GPU. If you are doing data
     # parallel training, please use `DistributedDataParallel` for model replication
-    # and specify the `device_ids` and `output_device` correctly.
+    # and specify the `device_ids` and `output_device` as below:
     model = torch.nn.parallel.DistributedDataParallel(
         model,
         device_ids=[local_rank],
         output_device=local_rank,
     )
 
     # Step 3: Create instances of `DistributedSampler` to partition the dataset
-        # across the GPUs.
+    # across the GPUs.
     train_sampler = torch.utils.data.distributed.DistributedSampler(train_set)
     eval_sampler = torch.utils.data.distributed.DistributedSampler(eval_set)
 
-    # Step 4: Create instances of `ZeusDataLoader`.
-    # Pass `"dp"` to `distributed` and samplers in the previous step to
-    # `sampler`.
+    # Step 4: Instantiate `ZeusDataLoader`.
+    # `distributed="dp"` tells `ZeusDataLoader` to operate in data parallel mode.
     # The one instantiated with `max_epochs` becomes the train dataloader.
     train_loader = ZeusDataLoader(train_set, batch_size=256, max_epochs=100,
-                                sampler=train_sampler, distributed="dp")
+                                  sampler=train_sampler, distributed="dp")
     eval_loader = ZeusDataLoader(eval_set, batch_size=256, sampler=eval_sampler,
-                                distributed="dp")
+                                 distributed="dp")
 
-    # Step 5: Put your training code here.
+    # Step 5: Training loop.
+    # Use the train dataloader's `epochs` generator to allow Zeus to early-stop
+    # based on the cost. Use `report_metric` to let Zeus know the current
+    # validation metric.
     for epoch_number in train_loader.epochs():
         for batch in train_loader:
             # Learn from batch
         for batch in eval_loader:
             # Evaluate on batch
 
-        # If doing data parallel training, please make sure to call
-        # `torch.distributed.all_reduce()` to reduce the validation metric
-        # across all GPUs before calling `train_loader.report_metric()`.
-        train_loader.report_metric(validation_metric)
+        # Make sure you all-reduce the validation metric across all GPUs,
+        # since Zeus expects the final validation metric.
+        val_metric_tensor = torch.tensor([validation_metric], device="cuda")
+        dist.all_reduce(val_metric_tensor, async_op=False)
+        train_loader.report_metric(val_metric_tensor.item())
     ```
 
     # Environment variables
@@ -474,11 +483,13 @@ def epochs(self) -> Generator[int, None, None]:
             Epoch indices starting from zero.
 
         Raises:
-            ZeusCostThresholdExceededException: the predicated cost after next epoch exceeds the cost threshold.
-                When doing data parallel training, this exception is used for ternimating ALL the processes.
+            ZeusCostThresholdExceededException: the predicted cost after the next
+                epoch exceeds the cost threshold. When doing data parallel training,
+                this exception is used for ternimating all the processes.
         """
         # Sanity check.
-        assert self._is_train, "Use epochs() on the train dataloader."
+        if not self._is_train:
+            raise RuntimeError("Use epochs() on the train dataloader.")
 
         while True:
             # Variables for storing time/energy consumption & cost
 
@@ -126,7 +126,7 @@ def build_logdir(
             job: Job to run.
             num_recurrence: The total number of recurrences.
             eta_knob: $\eta$ used in the cost metric.
-                $cost = \eta * ETA + (1 - \eta) * MaxPower * TTA$
+                $\textrm{cost} = \eta \cdot \textrm{ETA} + (1 - \eta) \cdot \textrm{MaxPower} \cdot \textrm{TTA}$
             beta_knob: `beta_knob * min_cost` is the early stopping cost threshold.
                 Set to `np.inf` to disable early stopping.
             exist_ok: Passed to `os.makedirs`. If `False`, will err if the directory
@@ -163,7 +163,7 @@ def run_job(
             rec_i: Recurrence number of this run of the job.
             tries: Retry number of this recurrence of the job.
             eta_knob: $\eta$ used in the cost metric.
-                $cost = \eta * ETA + (1 - \eta) * MaxPower * TTA$
+                $\textrm{cost} = \eta \cdot \textrm{ETA} + (1 - \eta) \cdot \textrm{MaxPower} \cdot \textrm{TTA}$
             cost_ub: Cost upper bound. The job is terminated when the next epoch is going
                 to exceed the cost upper bound.
 
@@ -261,7 +261,7 @@ def run(
             beta_knob: `beta_knob * min_eta` is the early stopping cost threshold.
                 Set to `np.inf` to disable early stopping.
             eta_knob: $\eta$ used in the cost metric.
-                $cost = \eta * ETA + (1 - \eta) * MaxPower * TTA$
+                $\textrm{cost} = \eta \cdot \textrm{ETA} + (1 - \eta) \cdot \textrm{MaxPower} \cdot \textrm{TTA}$
 
         Returns:
             A list of [`HistoryEntry`][zeus.analyze.HistoryEntry] objects for each job run.
 
@@ -97,7 +97,7 @@ def simulate_one_job(
             beta_knob: `beta_knob * min_eta` is the early stopping cost threshold.
                 Set to `np.inf` to disable early stopping.
             eta_knob: $\eta$ used in the hybrid cost metric.
-                $cost = \eta * ETA + (1 - \eta) * MaxPower * TTA$
+                $\textrm{cost} = \eta \cdot \textrm{ETA} + (1 - \eta) \cdot \textrm{MaxPower} \cdot \textrm{TTA}$
 
         Returns:
             A list of [`HistoryEntry`][zeus.analyze.HistoryEntry] objects for each job run.
@@ -244,7 +244,7 @@ def simulate_one_alibaba_group(
             beta_knob: `beta_knob * min_eta` is the early stopping cost threshold.
                 Set to `np.inf` to disable early stopping.
             eta_knob: $\eta$ used in the hybrid cost metric.
-                $cost = \eta * ETA + (1 - \eta) * MaxPower * TTA$
+                $\textrm{cost} = \eta \cdot \textrm{ETA} + (1 - \eta) \cdot \textrm{MaxPower} \cdot \textrm{TTA}$
 
         Returns:
             A list of [`HistoryEntry`][zeus.analyze.HistoryEntry] objects for each job run.
@@ -681,7 +681,7 @@ def _run_job(
             cost_ub: Cost upper bound. The job is terminated when the next epoch is going
                 to exceed the cost upper bound.
             eta_knob: $\eta$ used in the hybrid cost metric.
-                $cost = \eta * ETA + (1 - \eta) * MaxPower * TTA$
+                $\textrm{cost} = \eta \cdot \textrm{ETA} + (1 - \eta) \cdot \textrm{MaxPower} \cdot \textrm{TTA}$
             profile_power: Whether this run of the job should profile power during the
                 first epoch.
 
 
@@ -48,11 +48,11 @@ class ZeusCostThresholdExceededException(Exception):
     are still alive.
 
     Attributes:
-        time_consumed: Time consumed till the current epoch.
-        energy_consumed: Energy consumed till the current epoch.
-        cost: Computed Zeus's energy-time cost metric till the current epoch.
-        next_cost: Predicted Zeus's energy-time cost metric after next epoch.
-        cost_thresh: The cost threshold.
+        time_consumed (float): Time consumed until the current epoch.
+        energy_consumed (float): Energy consumed until the current epoch.
+        cost (float): Computed Zeus's energy-time cost metric until the current epoch.
+        next_cost (float): Predicted Zeus's energy-time cost metric after next epoch.
+        cost_thresh (float): The cost threshold.
     """
 
     def __init__(
 
@@ -14,9 +14,10 @@ Usage: ./zeus_monitor LOGFILE DURATION SLEEP_MS [GPU_IDX]
 
 ## Building
 
+The Zeus monitor is pre-built for you if you're using our [Docker image](https://ml.energy/zeus/getting_started/environment/).
+
 ### Dependencies
 
-All dependencies are pre-installed if you're using our Docker image.  
 1. CMake >= 3.22
 1. CUDAToolkit, especially NVML (`libnvidia-ml.so`)