H200全量训练DeepSeek-R1-Distill-Llama-70B，采用zero3(batch_size=1)时溢出。采用zero3+offload（优化器、参数）时，显存占用少（141G显存只占用27G)，cpu占用高 #7282

github-eliviate · 2025-03-13T03:03:14Z

Reminder

I have read the above rules and searched the existing issues.

System Info

llama-factory版本：0.8.3.dev0
ubuntu:22.04
python: python3.10
torch: 2.5.1

Reproduction

Put your message here.

Others

No response

The text was updated successfully, but these errors were encountered:

github-eliviate · 2025-03-13T03:11:24Z

H200(8*141G)全量训练DeepSeek-R1-Distill-Llama-70B

采用zero3(batch_size=1)时溢出。
采用zero3+offload优化器和参数（batch_size=1），显存占用少（141G显存只占用27G)，cpu占用高
增加batch_size=64，仍采用zero3+offload优化器和参数，显存占用不变（141G显存只占用27G)，cpu占用增加

配置文件examples/train_full/llama3_full_sft_ds3.yaml如下：

### model
model_name_or_path: /disk2/liweichao/DeepSeek-R1-Distill-Llama-70B

### method
stage: sft
do_train: true
finetuning_type: full
deepspeed: examples/deepspeed/ds_z3_offload_config.json

### dataset
dataset: identity
template: llama3
cutoff_len: 1024
# max_samples: 1000
overwrite_cache: true
preprocessing_num_workers: 16

### output
output_dir: saves/llama3-70B/full/sft
logging_steps: 10
save_steps: 500
plot_loss: true
overwrite_output_dir: true

### train
per_device_train_batch_size: 1
gradient_accumulation_steps: 1
learning_rate: 1.0e-4
num_train_epochs: 3.0
lr_scheduler_type: cosine
warmup_ratio: 0.1
bf16: true
ddp_timeout: 180000000


### eval
val_size: 0.1
per_device_eval_batch_size: 1
eval_strategy: steps
eval_steps: 500

deepspeed参数（examples/deepspeed/ds_z3_offload_config.json ）：

{
  "train_batch_size": "auto",
  "train_micro_batch_size_per_gpu": "auto",
  "gradient_accumulation_steps": "auto",
  "gradient_clipping": "auto",
  "zero_allow_untested_optimizer": true,
  "fp16": {
    "enabled": "auto",
    "loss_scale": 0,
    "loss_scale_window": 1000,
    "initial_scale_power": 16,
    "hysteresis": 2,
    "min_loss_scale": 1
  },
  "bf16": {
    "enabled": "auto"
  },
  "zero_optimization": {
    "stage": 3,
    "offload_optimizer": {
      "device": "cpu",
      "pin_memory": true
    },
    "offload_param": {
      "device": "cpu",
      "pin_memory": true
    },
    "overlap_comm": true,
    "contiguous_gradients": true,
    "sub_group_size": 1e9,
    "reduce_bucket_size": "auto",
    "stage3_prefetch_bucket_size": "auto",
    "stage3_param_persistence_threshold": "auto",
    "stage3_max_live_parameters": 1e9,
    "stage3_max_reuse_distance": 1e9,
    "stage3_gather_16bit_weights_on_model_save": true
  }
}

训练命令：

FORCE_TORCHRUN=1 torchrun --nnodes 1 --node_rank 0 --nproc_per_node 8 --master_addr 127.0.0.1 --master_port 20004 src/train.py  examples/train_full/llama3_full_sft_ds3.yaml

github-eliviate added bug Something isn't working pending This problem is yet to be addressed labels Mar 13, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

H200全量训练DeepSeek-R1-Distill-Llama-70B，采用zero3(batch_size=1)时溢出。采用zero3+offload（优化器、参数）时，显存占用少（141G显存只占用27G)，cpu占用高 #7282

H200全量训练DeepSeek-R1-Distill-Llama-70B，采用zero3(batch_size=1)时溢出。采用zero3+offload（优化器、参数）时，显存占用少（141G显存只占用27G)，cpu占用高 #7282

github-eliviate commented Mar 13, 2025

github-eliviate commented Mar 13, 2025 •

edited

Loading

H200全量训练DeepSeek-R1-Distill-Llama-70B，采用zero3(batch_size=1)时溢出。采用zero3+offload（优化器、参数）时，显存占用少（141G显存只占用27G)，cpu占用高 #7282

H200全量训练DeepSeek-R1-Distill-Llama-70B，采用zero3(batch_size=1)时溢出。采用zero3+offload（优化器、参数）时，显存占用少（141G显存只占用27G)，cpu占用高 #7282

Comments

github-eliviate commented Mar 13, 2025

Reminder

System Info

Reproduction

Others

github-eliviate commented Mar 13, 2025 • edited Loading

github-eliviate commented Mar 13, 2025 •

edited

Loading