This example will demonstrate how to integrate Zeus with torchvision
and the CIFAR100 dataset provided by it.
We believe that it would be straightforward to extend this example to support other image classification datasets such as CIFAR10 or ImageNet.
You can search for # ZEUS
in train.py
for noteworthy places that require modification from conventional training scripts.
Usages
- Zeus
- Extra
While our paper is about optimizing the batch size and power limit over multiple recurrences of the job, it is also possible to use just ZeusDataLoader
to JIT-profile and optimize the power limit.
All packages are pre-installed if you're using our Docker image.
- Install
zeus
and build the power monitor, following Installing and Building. - Install
torchvision
:conda install -c pytorch torchvision==0.11.2
ZeusDataLoader
interfaces with the outside world via environment variables.
Check out its class reference for details.
Only ZEUS_TARGET_METRIC
is required; other environment variables below show their default values when omitted.
export ZEUS_TARGET_METRIC="0.50" # Stop training when target val metric is reached
export ZEUS_LOG_DIR="zeus_log" # Directory to store profiling logs
export ZEUS_JOB_ID="zeus" # Used to distinguish recurrences, so not important
export ZEUS_COST_THRESH="inf" # Kill training when cost (Equation 2) exceeds this
export ZEUS_ETA_KNOB="0.5" # Knob to tradeoff energy and time (Equation 2)
export ZEUS_MONITOR_PATH="/workspace/zeus/zeus_monitor/zeus_monitor" # Path to power monitor
export ZEUS_PROFILE_PARAMS="10,40" # warmup_iters,profile_iters for each power limit
export ZEUS_USE_OPTIMAL_PL="True" # Whether to acutally use the optimal PL found
python train.py \
--zeus \
--arch shufflenetv2 \
--epochs 100 \
--batch_size 128
This example shows how to integrate ZeusDataLoader
and drive batch size and power optimizations with ZeusMaster
.
All packages are pre-installed if you're using our Docker image.
- Install
zeus
and build the power monitor, following Installing and Building. - Install
torchvision
:conda install -c pytorch torchvision==0.11.2
# All arguments shown below are default values.
python run_zeus.py \
--seed 1 \
--b_0 1024 \
--b_min 8 \
--b_max 4096 \
--num_recurrence 100 \
--eta_knob 0.5 \
--beta_knob 2.0 \
--target_metric 0.50 \
--max_epochs 100
You can use Zeus's ProfileDataLoader
to profile the power and time consumption of training.
All packages are pre-installed if you're using our Docker image.
- Install
zeus
and build the power monitor, following Installing and Building. - Install
torchvision
:conda install -c pytorch torchvision==0.11.2
ProfileDataLoader
interfaces with the outside world via environment variables.
Check out its class reference for details.
Only ZEUS_LOG_PREFIX
is required; other environment variables below show their default values when omitted.
export ZEUS_LOG_PREFIX="cifar100+shufflenetv2" # Filename prefix for power and time log files
export ZEUS_MONITOR_SLEEP_MS="100" # Milliseconds to sleep after sampling power
export ZEUS_MONITOR_PATH="/workspace/zeus/zeus_monitor/zeus_monitor" # Path to power monitor
python train.py \
--profile \
--arch shufflenetv2 \
--epochs 2 \
--batch_size 1024
A CSV file of timestamped momentary power draw of the first GPU (index 0) will be written to cifar100+shufflenetv2+gpu0.power.csv
(the +gpu0.power.csv
part was added by ProfileDataLoader
).
At the same time, a CSV file with headers epoch number, split (train
or eval
), and time consumption in seconds will be written to cifar100+shufflenetv2.time.csv
(the .time.csv
part was added by ProfileDataLoader
).
train.py
can also be used as a simple training script, without having to do anything with Zeus.
All packages are pre-installed if you're using our Docker image.
- Install
torchvision
:conda install -c pytorch torchvision==0.11.2
python train.py \
--arch shufflenetv2 \
--epochs 100 \
--batch_size 1024