Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

H200全量训练DeepSeek-R1-Distill-Llama-70B,采用zero3(batch_size=1)时溢出。采用zero3+offload(优化器、参数)时,显存占用少(141G显存只占用27G),cpu占用高 #7282

Open
1 task done
github-eliviate opened this issue Mar 13, 2025 · 1 comment
Labels
bug Something isn't working pending This problem is yet to be addressed

Comments

@github-eliviate
Copy link

Reminder

  • I have read the above rules and searched the existing issues.

System Info

llama-factory版本:0.8.3.dev0
ubuntu:22.04
python: python3.10
torch: 2.5.1

Reproduction

Put your message here.

Others

No response

@github-eliviate github-eliviate added bug Something isn't working pending This problem is yet to be addressed labels Mar 13, 2025
@github-eliviate
Copy link
Author

github-eliviate commented Mar 13, 2025

H200(8*141G)全量训练DeepSeek-R1-Distill-Llama-70B

  1. 采用zero3(batch_size=1)时溢出。
  2. 采用zero3+offload优化器和参数(batch_size=1),显存占用少(141G显存只占用27G),cpu占用高
  3. 增加batch_size=64,仍采用zero3+offload优化器和参数,显存占用不变(141G显存只占用27G),cpu占用增加

配置文件examples/train_full/llama3_full_sft_ds3.yaml如下:

### model
model_name_or_path: /disk2/liweichao/DeepSeek-R1-Distill-Llama-70B

### method
stage: sft
do_train: true
finetuning_type: full
deepspeed: examples/deepspeed/ds_z3_offload_config.json

### dataset
dataset: identity
template: llama3
cutoff_len: 1024
# max_samples: 1000
overwrite_cache: true
preprocessing_num_workers: 16

### output
output_dir: saves/llama3-70B/full/sft
logging_steps: 10
save_steps: 500
plot_loss: true
overwrite_output_dir: true

### train
per_device_train_batch_size: 1
gradient_accumulation_steps: 1
learning_rate: 1.0e-4
num_train_epochs: 3.0
lr_scheduler_type: cosine
warmup_ratio: 0.1
bf16: true
ddp_timeout: 180000000


### eval
val_size: 0.1
per_device_eval_batch_size: 1
eval_strategy: steps
eval_steps: 500

deepspeed参数(examples/deepspeed/ds_z3_offload_config.json ):

{
  "train_batch_size": "auto",
  "train_micro_batch_size_per_gpu": "auto",
  "gradient_accumulation_steps": "auto",
  "gradient_clipping": "auto",
  "zero_allow_untested_optimizer": true,
  "fp16": {
    "enabled": "auto",
    "loss_scale": 0,
    "loss_scale_window": 1000,
    "initial_scale_power": 16,
    "hysteresis": 2,
    "min_loss_scale": 1
  },
  "bf16": {
    "enabled": "auto"
  },
  "zero_optimization": {
    "stage": 3,
    "offload_optimizer": {
      "device": "cpu",
      "pin_memory": true
    },
    "offload_param": {
      "device": "cpu",
      "pin_memory": true
    },
    "overlap_comm": true,
    "contiguous_gradients": true,
    "sub_group_size": 1e9,
    "reduce_bucket_size": "auto",
    "stage3_prefetch_bucket_size": "auto",
    "stage3_param_persistence_threshold": "auto",
    "stage3_max_live_parameters": 1e9,
    "stage3_max_reuse_distance": 1e9,
    "stage3_gather_16bit_weights_on_model_save": true
  }
}

训练命令:

FORCE_TORCHRUN=1 torchrun --nnodes 1 --node_rank 0 --nproc_per_node 8 --master_addr 127.0.0.1 --master_port 20004 src/train.py  examples/train_full/llama3_full_sft_ds3.yaml

@github-eliviate github-eliviate changed the title H200全量训练DeepSeek-R1-Distill-Llama-70B,采用zero3(batch_size=1)时溢出。采用zero3+offload(优化器、参数)时,显存只占用少(141G显存只占用27G),cpu占用高 H200全量训练DeepSeek-R1-Distill-Llama-70B,采用zero3(batch_size=1)时溢出。采用zero3+offload(优化器、参数)时,显存占用少(141G显存只占用27G),cpu占用高 Mar 13, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working pending This problem is yet to be addressed
Projects
None yet
Development

No branches or pull requests

1 participant