Skip to content

Latest commit

 

History

History
 
 

capriccio

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 
 
 

Integrating Zeus with Huggingface and Capriccio

This example will demonstrate how to integrate Zeus with Capriccio, a drifting sentiment analysis dataset.

You can search for # ZEUS in train.py for noteworthy places that require modification from conventional training scripts. Parts relevant to using Capriccio are also marked with # CAPRICCIO.

Usages

Running Zeus for a single job

While our paper is about optimizing the batch size and power limit over multiple recurrences of the job, it is also possible to use just ZeusDataLoader to JIT-profile and optimize the power limit.

Dependencies

  1. Generate Capriccio, following the instructions in Capriccio's README.md.
  2. If you're not using our Docker image, install zeus and build the power monitor, following Installing and Building.
  3. Install python dependencies for this example:
    pip install -r requirements.txt

Example command

ZeusDataLoader interfaces with the outside world via environment variables. Check out the class reference for details.

Only ZEUS_TARGET_METRIC is required; other environment variables below show their default values when omitted.

export ZEUS_TARGET_METRIC="0.84"               # Stop training when target val metric is reached
export ZEUS_LOG_DIR="zeus_log"                 # Directory to store profiling logs
export ZEUS_JOB_ID="zeus"                      # Used to distinguish recurrences, so not important
export ZEUS_COST_THRESH="inf"                  # Kill training when cost (Equation 2) exceeds this
export ZEUS_ETA_KNOB="0.5"                     # Knob to tradeoff energy and time (Equation 2)
export ZEUS_MONITOR_PATH="/workspace/zeus/zeus_monitor/zeus_monitor" # Path to power monitor
export ZEUS_PROFILE_PARAMS="10,40"              # warmup_iters,profile_iters for each power limit
export ZEUS_USE_OPTIMAL_PL="True"              # Whether to acutally use the optimal PL found

python train.py \
    --zeus \
    --data_dir data \
    --slice_number 9 \
    --model_name_or_path bert-base-uncased \
    --batch_size 128

Running Zeus over multiple recurrences

This example shows how to integrate ZeusDataLoader and drive batch size and power optimizations with ZeusMaster.

Dependencies

  1. Generate Capriccio, following the instructions in Capriccio's README.md.
  2. If you're not using our Docker image, install zeus and build the power monitor, following Installing and Building.
  3. Install python dependencies for this example:
    pip install -r requirements.txt

Example command

# All arguments shown below are default values.
python run_zeus.py \
    --seed 123 \
    --b_0 128 \
    --lr_0 4.00e-7 \
    --b_min 8 \
    --b_max 128 \
    --num_recurrence 38 \
    --eta_knob 0.5 \
    --beta_knob 2.0 \
    --target_metric 0.84 \
    --max_epochs 10 \
    --window_size 10

Profiling power and time

You can use Zeus's ProfileDataLoader to profile the power and time consumption of training.

Dependencies

  1. Generate Capriccio, following the instructions in Capriccio's README.md.
  2. If you're not using our Docker image, install zeus and build the power monitor, following Installing and Building.
  3. Install python dependencies for this example:
    pip install -r requirements.txt

Example command

ProfileDataLoader interfaces with the outside world via environment variables. Check out its class reference for details.

Only ZEUS_LOG_PREFIX is required; other environment variables below show their default values when omitted.

export ZEUS_LOG_PREFIX="capriccio"              # Filename prefix for power and time log files
export ZEUS_MONITOR_SLEEP_MS="100"              # Milliseconds to sleep after sampling power
export ZEUS_MONITOR_PATH="/workspace/zeus/zeus_monitor/zeus_monitor"  # Path to power monitor

python train.py \
    --profile \
    --data_dir ../../capriccio/data \
    --slice_number 9 \
    --model_name_or_path bert-base-uncased \
    --batch_size 128

A CSV file of timestamped momentary power draw of the first GPU (index 0) will be written to capriccio+gpu0.power.csv. At the same time, a CSV file with headers epoch number, split (train or eval), and time consumption in seconds will be written to capriccio.time.csv.

Fine-tuning a Huggingface language model on one slice

train.py can also be used to fine-tune a pretrained language model on one slice of Capriccio, without having to do anything with Zeus.

Dependencies

  1. Generate Capriccio, following the instructions in Capriccio's README.md.
  2. Only for those not using our Docker image, install PyTorch separately:
    conda install -c pytorch pytorch==1.10.1
  3. Install python dependencies for this example:
    pip install -r requirements.txt

Example command

python train.py \
    --data_dir data \
    --slice_number 9 \
    --model_name_or_path bert-base-uncased \
    --batch_size 128