This example will demonstrate how to integrate Zeus with Capriccio, a drifting sentiment analysis dataset.
You can search for # ZEUS
in train.py
for noteworthy places that require modification from conventional training scripts.
Parts relevant to using Capriccio are also marked with # CAPRICCIO
.
Usages
- Zeus
- Extra
While our paper is about optimizing the batch size and power limit over multiple recurrences of the job, it is also possible to use just ZeusDataLoader
to JIT-profile and optimize the power limit.
- Generate Capriccio, following the instructions in Capriccio's README.md.
- If you're not using our Docker image, install
zeus
and build the power monitor, following Installing and Building. - Install python dependencies for this example:
pip install -r requirements.txt
ZeusDataLoader
interfaces with the outside world via environment variables.
Check out the class reference for details.
Only ZEUS_TARGET_METRIC
is required; other environment variables below show their default values when omitted.
export ZEUS_TARGET_METRIC="0.84" # Stop training when target val metric is reached
export ZEUS_LOG_DIR="zeus_log" # Directory to store profiling logs
export ZEUS_JOB_ID="zeus" # Used to distinguish recurrences, so not important
export ZEUS_COST_THRESH="inf" # Kill training when cost (Equation 2) exceeds this
export ZEUS_ETA_KNOB="0.5" # Knob to tradeoff energy and time (Equation 2)
export ZEUS_MONITOR_PATH="/workspace/zeus/zeus_monitor/zeus_monitor" # Path to power monitor
export ZEUS_PROFILE_PARAMS="10,40" # warmup_iters,profile_iters for each power limit
export ZEUS_USE_OPTIMAL_PL="True" # Whether to acutally use the optimal PL found
python train.py \
--zeus \
--data_dir data \
--slice_number 9 \
--model_name_or_path bert-base-uncased \
--batch_size 128
This example shows how to integrate ZeusDataLoader
and drive batch size and power optimizations with ZeusMaster
.
- Generate Capriccio, following the instructions in Capriccio's README.md.
- If you're not using our Docker image, install
zeus
and build the power monitor, following Installing and Building. - Install python dependencies for this example:
pip install -r requirements.txt
# All arguments shown below are default values.
python run_zeus.py \
--seed 123 \
--b_0 128 \
--lr_0 4.00e-7 \
--b_min 8 \
--b_max 128 \
--num_recurrence 38 \
--eta_knob 0.5 \
--beta_knob 2.0 \
--target_metric 0.84 \
--max_epochs 10 \
--window_size 10
You can use Zeus's ProfileDataLoader
to profile the power and time consumption of training.
- Generate Capriccio, following the instructions in Capriccio's README.md.
- If you're not using our Docker image, install
zeus
and build the power monitor, following Installing and Building. - Install python dependencies for this example:
pip install -r requirements.txt
ProfileDataLoader
interfaces with the outside world via environment variables.
Check out its class reference for details.
Only ZEUS_LOG_PREFIX
is required; other environment variables below show their default values when omitted.
export ZEUS_LOG_PREFIX="capriccio" # Filename prefix for power and time log files
export ZEUS_MONITOR_SLEEP_MS="100" # Milliseconds to sleep after sampling power
export ZEUS_MONITOR_PATH="/workspace/zeus/zeus_monitor/zeus_monitor" # Path to power monitor
python train.py \
--profile \
--data_dir ../../capriccio/data \
--slice_number 9 \
--model_name_or_path bert-base-uncased \
--batch_size 128
A CSV file of timestamped momentary power draw of the first GPU (index 0) will be written to capriccio+gpu0.power.csv
.
At the same time, a CSV file with headers epoch number, split (train or eval), and time consumption in seconds will be written to capriccio.time.csv
.
train.py
can also be used to fine-tune a pretrained language model on one slice of Capriccio, without having to do anything with Zeus.
- Generate Capriccio, following the instructions in Capriccio's README.md.
- Only for those not using our Docker image, install PyTorch separately:
conda install -c pytorch pytorch==1.10.1
- Install python dependencies for this example:
pip install -r requirements.txt
python train.py \
--data_dir data \
--slice_number 9 \
--model_name_or_path bert-base-uncased \
--batch_size 128