内存溢出错误：CUDA 内存不足。尝试分配 108.00 MiB 的内存。GPU 0 的总内存容量为 23.68 GiB，其中空闲内存为 34.88 MiB。进程 1170126 正在使用 2.25 GiB 的内存。 #730

huimoran · 2025-03-11T15:38:19Z

(glm-4) ubuntu@c54:~/zch/glm-4/GLM-4-main/finetune_demo$ nvidia-smi
Tue Mar 11 15:31:20 2025
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.183.01 Driver Version: 535.183.01 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA RTX A5000 Off | 00000000:01:00.0 Off | Off |
| 30% 27C P8 7W / 230W | 1640MiB / 24564MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 1 NVIDIA RTX A5000 Off | 00000000:25:00.0 Off | Off |
| 30% 30C P8 6W / 230W | 2318MiB / 24564MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 2 NVIDIA RTX A5000 Off | 00000000:41:00.0 Off | Off |
| 30% 27C P8 9W / 230W | 2318MiB / 24564MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 3 NVIDIA RTX A5000 Off | 00000000:61:00.0 Off | Off |
| 30% 25C P8 9W / 230W | 4099MiB / 24564MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 4 NVIDIA RTX A5000 Off | 00000000:81:00.0 Off | Off |
| 30% 26C P8 4W / 230W | 4463MiB / 24564MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 5 NVIDIA RTX A5000 Off | 00000000:A1:00.0 Off | Off |
| 30% 26C P8 8W / 230W | 2318MiB / 24564MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 6 NVIDIA RTX A5000 Off | 00000000:C1:00.0 Off | Off |
| 30% 24C P8 6W / 230W | 4069MiB / 24564MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 7 NVIDIA RTX A5000 Off | 00000000:E1:00.0 Off | Off |
| 30% 24C P8 8W / 230W | 1640MiB / 24564MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
+---------------------------------------------------------------------------------------+
(glm-4) ubuntu@c54:~/zch/glm-4/GLM-4-main/finetune_demo$

命令：（glm-4）ubuntu@c54:~/zch/glm-4/GLM-4-main/finetune_demo$ OMP_NUM_THREADS=1 torchrun --standalone --nnodes=1 --nproc_per_node=8 finetune.py data ./glm-4-9b-chat configs/lora.yaml

我有 8 块闲置的显卡，每块显卡的内存容量为 24GB。然而，当我使用上述命令运行模型时，出现了错误。能否有人帮帮我？

[rank6]: OutOfMemoryError: CUDA out of memory. Tried to allocate 108.00 MiB. GPU 0 has a total capacity of 23.68 GiB of which 34.88 MiB is free. Process 1170126 has 2.25
[rank6]: GiB memory in use. Process 1177435 has 5.44 GiB memory in use. Process 1177429 has 5.44 GiB memory in use. Process 1177431 has 5.83 GiB memory in use. Process
[rank6]: 1177433 has 4.67 GiB memory in use. Of the allocated memory 5.23 GiB is allocated by PyTorch, and 4.81 MiB is reserved by PyTorch but unallocated. If reserved but
[rank6]: unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management
[rank6]: (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
[rank2]:[W311 15:29:11.938241134 ProcessGroupNCCL.cpp:1496] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more

lora.yaml:
data_config:
train_file: train.jsonl
val_file: dev.jsonl
test_file: test.jsonl
num_proc: 1

combine: True
freezeV: True
max_input_length: 2048
max_output_length: 2048

training_args:

see transformers.Seq2SeqTrainingArguments
output_dir: ./output
max_steps: 3000

needed to be fit for the dataset
learning_rate: 5e-4

settings for data loading
per_device_train_batch_size: 1
dataloader_num_workers: 16
remove_unused_columns: false

settings for saving checkpoints
save_strategy: steps
save_steps: 500

settings for logging
log_level: info
logging_strategy: steps
logging_steps: 10

settings for evaluation
per_device_eval_batch_size: 4
eval_strategy: steps
eval_steps: 500

settings for optimizer
adam_epsilon: 1e-6
uncomment the following line to detect nan or inf values
debug: underflow_overflow
predict_with_generate: true

see transformers.GenerationConfig
generation_config:
max_new_tokens: 512

set your absolute deepspeed path here
deepspeed: configs/ds_zero_3.json
peft_config:
peft_type: LORA
task_type: CAUSAL_LM
r: 8
lora_alpha: 32
lora_dropout: 0.1
target_modules: ["query_key_value"]
#target_modules: ["q_proj", "k_proj", "v_proj"] if model is glm-4-9b-chat-hf

huimoran · 2025-03-12T02:17:11Z

加上了：deepspeed还是报错
`data_config:
train_file: train.jsonl
val_file: dev.jsonl
test_file: test.jsonl
num_proc: 1

combine: True
freezeV: True
max_input_length: 2048
max_output_length: 2048

training_args:

see `transformers.Seq2SeqTrainingArguments`

output_dir: ./output
max_steps: 3000

needed to be fit for the dataset

learning_rate: 5e-4

settings for data loading

per_device_train_batch_size: 1
dataloader_num_workers: 16
remove_unused_columns: false

settings for saving checkpoints

save_strategy: steps
save_steps: 500

settings for logging

log_level: info
logging_strategy: steps
logging_steps: 10

settings for evaluation

per_device_eval_batch_size: 4
eval_strategy: steps
eval_steps: 500

settings for optimizer

adam_epsilon: 1e-6

uncomment the following line to detect nan or inf values

debug: underflow_overflow

predict_with_generate: true

see `transformers.GenerationConfig`

generation_config:
max_new_tokens: 512

set your absolute deepspeed path here

deepspeed: configs/ds_zero_3.json

peft_config:
peft_type: LORA
task_type: CAUSAL_LM
r: 8
lora_alpha: 32
lora_dropout: 0.1
target_modules: ["query_key_value"]
#target_modules: ["q_proj", "k_proj", "v_proj"] if model is glm-4-9b-chat-hf
命令：OMP_NUM_THREADS=1 torchrun --standalone --nnodes=1 --nproc_per_node=8 finetune.py data ./glm-4-9b-chat configs/lora.yaml`

报错：[rank4]: OutOfMemoryError: CUDA out of memory. Tried to allocate 1.26 GiB. GPU 4 has a total capacity of 23.68 GiB of which 1.03 GiB is free. Process 1405495 has 2.09 GiB memory in [rank4]: use. Process 1170126 has 2.25 GiB memory in use. Process 1194375 has 18.29 GiB memory in use. Of the allocated memory 17.90 GiB is allocated by PyTorch, and 3.37 MiB is [rank4]: reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See [rank4]: documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

sixsixcoder · 2025-03-12T03:48:53Z

如果配置没有问题的话，你可以试着调小这些参数再试试，这些会导致显存占用过大

max_input_length: 2048
max_output_length: 2048
max_steps: 3000

huimoran · 2025-03-12T11:14:24Z

lora.yaml文件：
`data_config:
train_file: train.jsonl
val_file: dev.jsonl
test_file: test.jsonl
num_proc: 1

combine: True
freezeV: True
max_input_length: 512
max_output_length: 512

training_args:

see `transformers.Seq2SeqTrainingArguments`

output_dir: ./output
max_steps: 500

needed to be fit for the dataset

learning_rate: 5e-4

settings for data loading

per_device_train_batch_size: 1
dataloader_num_workers: 16
remove_unused_columns: false

settings for saving checkpoints

save_strategy: steps
save_steps: 500

settings for logging

log_level: info
logging_strategy: steps
logging_steps: 10

settings for evaluation

per_device_eval_batch_size: 4
eval_strategy: steps
eval_steps: 500

settings for optimizer

adam_epsilon: 1e-6

uncomment the following line to detect nan or inf values

debug: underflow_overflow

predict_with_generate: true

see `transformers.GenerationConfig`

generation_config:
max_new_tokens: 512

set your absolute deepspeed path here

deepspeed: ds_zero_3.json

peft_config:
peft_type: LORA
task_type: CAUSAL_LM
r: 8
lora_alpha: 32
lora_dropout: 0.1
target_modules: ["query_key_value"]
#target_modules: ["q_proj", "k_proj", "v_proj"] if model is glm-4-9b-chat-hf
`

命令：
OMP_NUM_THREADS=1 torchrun --standalone --nnodes=1 --nproc_per_node=8 finetune.py data ./glm-4-9b-chat configs/lora.yaml

我把这三个参数调小了，还是出现相同的错误
[rank7]: OutOfMemoryError: CUDA out of memory. Tried to allocate 214.00 MiB. GPU 7 has a total capacity of 23.68 GiB of which 86.00 MiB is
[rank7]: free. Process 1343547 has 6.18 GiB memory in use. Process 1462076 has 3.78 GiB memory in use. Process 1607439 has 13.62 GiB memory
[rank7]: in use. Of the allocated memory 13.42 GiB is allocated by PyTorch, and 438.50 KiB is reserved by PyTorch but unallocated. If
[rank7]: reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See
[rank7]: documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

sixsixcoder mentioned this issue Mar 12, 2025

OutOfMemoryError: CUDA out of memory. Tried to allocate 108.00 MiB. GPU 0 has a total capacity of 23.68 GiB of which 34.88 MiB is free. Process 1170126 has 2.25 #729

Closed

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

内存溢出错误：CUDA 内存不足。尝试分配 108.00 MiB 的内存。GPU 0 的总内存容量为 23.68 GiB，其中空闲内存为 34.88 MiB。进程 1170126 正在使用 2.25 GiB 的内存。 #730

内存溢出错误：CUDA 内存不足。尝试分配 108.00 MiB 的内存。GPU 0 的总内存容量为 23.68 GiB，其中空闲内存为 34.88 MiB。进程 1170126 正在使用 2.25 GiB 的内存。 #730

huimoran commented Mar 11, 2025

huimoran commented Mar 12, 2025

sixsixcoder commented Mar 12, 2025

huimoran commented Mar 12, 2025

内存溢出错误：CUDA 内存不足。尝试分配 108.00 MiB 的内存。GPU 0 的总内存容量为 23.68 GiB，其中空闲内存为 34.88 MiB。进程 1170126 正在使用 2.25 GiB 的内存。 #730

内存溢出错误：CUDA 内存不足。尝试分配 108.00 MiB 的内存。GPU 0 的总内存容量为 23.68 GiB，其中空闲内存为 34.88 MiB。进程 1170126 正在使用 2.25 GiB 的内存。 #730

Comments

huimoran commented Mar 11, 2025

huimoran commented Mar 12, 2025

see transformers.Seq2SeqTrainingArguments

needed to be fit for the dataset

settings for data loading

settings for saving checkpoints

settings for logging

settings for evaluation

settings for optimizer

adam_epsilon: 1e-6

uncomment the following line to detect nan or inf values

debug: underflow_overflow

see transformers.GenerationConfig

set your absolute deepspeed path here

sixsixcoder commented Mar 12, 2025

huimoran commented Mar 12, 2025

see transformers.Seq2SeqTrainingArguments

needed to be fit for the dataset

settings for data loading

settings for saving checkpoints

settings for logging

settings for evaluation

settings for optimizer

adam_epsilon: 1e-6

uncomment the following line to detect nan or inf values

debug: underflow_overflow

see transformers.GenerationConfig

set your absolute deepspeed path here

see `transformers.Seq2SeqTrainingArguments`

see `transformers.GenerationConfig`

see `transformers.Seq2SeqTrainingArguments`

see `transformers.GenerationConfig`