Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

内存溢出错误:CUDA 内存不足。尝试分配 108.00 MiB 的内存。GPU 0 的总内存容量为 23.68 GiB,其中空闲内存为 34.88 MiB。进程 1170126 正在使用 2.25 GiB 的内存。 #730

Open
huimoran opened this issue Mar 11, 2025 · 3 comments

Comments

@huimoran
Copy link

(glm-4) ubuntu@c54:~/zch/glm-4/GLM-4-main/finetune_demo$ nvidia-smi
Tue Mar 11 15:31:20 2025
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.183.01 Driver Version: 535.183.01 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA RTX A5000 Off | 00000000:01:00.0 Off | Off |
| 30% 27C P8 7W / 230W | 1640MiB / 24564MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 1 NVIDIA RTX A5000 Off | 00000000:25:00.0 Off | Off |
| 30% 30C P8 6W / 230W | 2318MiB / 24564MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 2 NVIDIA RTX A5000 Off | 00000000:41:00.0 Off | Off |
| 30% 27C P8 9W / 230W | 2318MiB / 24564MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 3 NVIDIA RTX A5000 Off | 00000000:61:00.0 Off | Off |
| 30% 25C P8 9W / 230W | 4099MiB / 24564MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 4 NVIDIA RTX A5000 Off | 00000000:81:00.0 Off | Off |
| 30% 26C P8 4W / 230W | 4463MiB / 24564MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 5 NVIDIA RTX A5000 Off | 00000000:A1:00.0 Off | Off |
| 30% 26C P8 8W / 230W | 2318MiB / 24564MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 6 NVIDIA RTX A5000 Off | 00000000:C1:00.0 Off | Off |
| 30% 24C P8 6W / 230W | 4069MiB / 24564MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 7 NVIDIA RTX A5000 Off | 00000000:E1:00.0 Off | Off |
| 30% 24C P8 8W / 230W | 1640MiB / 24564MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
+---------------------------------------------------------------------------------------+
(glm-4) ubuntu@c54:~/zch/glm-4/GLM-4-main/finetune_demo$

命令:(glm-4)ubuntu@c54:~/zch/glm-4/GLM-4-main/finetune_demo$ OMP_NUM_THREADS=1 torchrun --standalone --nnodes=1 --nproc_per_node=8 finetune.py data ./glm-4-9b-chat configs/lora.yaml

我有 8 块闲置的显卡,每块显卡的内存容量为 24GB。然而,当我使用上述命令运行模型时,出现了错误。能否有人帮帮我?

[rank6]: OutOfMemoryError: CUDA out of memory. Tried to allocate 108.00 MiB. GPU 0 has a total capacity of 23.68 GiB of which 34.88 MiB is free. Process 1170126 has 2.25
[rank6]: GiB memory in use. Process 1177435 has 5.44 GiB memory in use. Process 1177429 has 5.44 GiB memory in use. Process 1177431 has 5.83 GiB memory in use. Process
[rank6]: 1177433 has 4.67 GiB memory in use. Of the allocated memory 5.23 GiB is allocated by PyTorch, and 4.81 MiB is reserved by PyTorch but unallocated. If reserved but
[rank6]: unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management
[rank6]: (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
[rank2]:[W311 15:29:11.938241134 ProcessGroupNCCL.cpp:1496] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more

lora.yaml:
data_config:
train_file: train.jsonl
val_file: dev.jsonl
test_file: test.jsonl
num_proc: 1

combine: True
freezeV: True
max_input_length: 2048
max_output_length: 2048

training_args:

see transformers.Seq2SeqTrainingArguments
output_dir: ./output
max_steps: 3000

needed to be fit for the dataset
learning_rate: 5e-4

settings for data loading
per_device_train_batch_size: 1
dataloader_num_workers: 16
remove_unused_columns: false

settings for saving checkpoints
save_strategy: steps
save_steps: 500

settings for logging
log_level: info
logging_strategy: steps
logging_steps: 10

settings for evaluation
per_device_eval_batch_size: 4
eval_strategy: steps
eval_steps: 500

settings for optimizer
adam_epsilon: 1e-6
uncomment the following line to detect nan or inf values
debug: underflow_overflow
predict_with_generate: true

see transformers.GenerationConfig
generation_config:
max_new_tokens: 512

set your absolute deepspeed path here
deepspeed: configs/ds_zero_3.json
peft_config:
peft_type: LORA
task_type: CAUSAL_LM
r: 8
lora_alpha: 32
lora_dropout: 0.1
target_modules: ["query_key_value"]
#target_modules: ["q_proj", "k_proj", "v_proj"] if model is glm-4-9b-chat-hf

@huimoran
Copy link
Author

加上了:deepspeed还是报错
`data_config:
train_file: train.jsonl
val_file: dev.jsonl
test_file: test.jsonl
num_proc: 1

combine: True
freezeV: True
max_input_length: 2048
max_output_length: 2048

training_args:

see transformers.Seq2SeqTrainingArguments

output_dir: ./output
max_steps: 3000

needed to be fit for the dataset

learning_rate: 5e-4

settings for data loading

per_device_train_batch_size: 1
dataloader_num_workers: 16
remove_unused_columns: false

settings for saving checkpoints

save_strategy: steps
save_steps: 500

settings for logging

log_level: info
logging_strategy: steps
logging_steps: 10

settings for evaluation

per_device_eval_batch_size: 4
eval_strategy: steps
eval_steps: 500

settings for optimizer

adam_epsilon: 1e-6

uncomment the following line to detect nan or inf values

debug: underflow_overflow

predict_with_generate: true

see transformers.GenerationConfig

generation_config:
max_new_tokens: 512

set your absolute deepspeed path here

deepspeed: configs/ds_zero_3.json

peft_config:
peft_type: LORA
task_type: CAUSAL_LM
r: 8
lora_alpha: 32
lora_dropout: 0.1
target_modules: ["query_key_value"]
#target_modules: ["q_proj", "k_proj", "v_proj"] if model is glm-4-9b-chat-hf
命令:OMP_NUM_THREADS=1 torchrun --standalone --nnodes=1 --nproc_per_node=8 finetune.py data ./glm-4-9b-chat configs/lora.yaml`

报错:[rank4]: OutOfMemoryError: CUDA out of memory. Tried to allocate 1.26 GiB. GPU 4 has a total capacity of 23.68 GiB of which 1.03 GiB is free. Process 1405495 has 2.09 GiB memory in [rank4]: use. Process 1170126 has 2.25 GiB memory in use. Process 1194375 has 18.29 GiB memory in use. Of the allocated memory 17.90 GiB is allocated by PyTorch, and 3.37 MiB is [rank4]: reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See [rank4]: documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

@sixsixcoder
Copy link
Collaborator

如果配置没有问题的话,你可以试着调小这些参数再试试,这些会导致显存占用过大

max_input_length: 2048
max_output_length: 2048
max_steps: 3000

@huimoran
Copy link
Author

lora.yaml文件:
`data_config:
train_file: train.jsonl
val_file: dev.jsonl
test_file: test.jsonl
num_proc: 1

combine: True
freezeV: True
max_input_length: 512
max_output_length: 512

training_args:

see transformers.Seq2SeqTrainingArguments

output_dir: ./output
max_steps: 500

needed to be fit for the dataset

learning_rate: 5e-4

settings for data loading

per_device_train_batch_size: 1
dataloader_num_workers: 16
remove_unused_columns: false

settings for saving checkpoints

save_strategy: steps
save_steps: 500

settings for logging

log_level: info
logging_strategy: steps
logging_steps: 10

settings for evaluation

per_device_eval_batch_size: 4
eval_strategy: steps
eval_steps: 500

settings for optimizer

adam_epsilon: 1e-6

uncomment the following line to detect nan or inf values

debug: underflow_overflow

predict_with_generate: true

see transformers.GenerationConfig

generation_config:
max_new_tokens: 512

set your absolute deepspeed path here

deepspeed: ds_zero_3.json

peft_config:
peft_type: LORA
task_type: CAUSAL_LM
r: 8
lora_alpha: 32
lora_dropout: 0.1
target_modules: ["query_key_value"]
#target_modules: ["q_proj", "k_proj", "v_proj"] if model is glm-4-9b-chat-hf
`

命令:
OMP_NUM_THREADS=1 torchrun --standalone --nnodes=1 --nproc_per_node=8 finetune.py data ./glm-4-9b-chat configs/lora.yaml

我把这三个参数调小了,还是出现相同的错误
[rank7]: OutOfMemoryError: CUDA out of memory. Tried to allocate 214.00 MiB. GPU 7 has a total capacity of 23.68 GiB of which 86.00 MiB is
[rank7]: free. Process 1343547 has 6.18 GiB memory in use. Process 1462076 has 3.78 GiB memory in use. Process 1607439 has 13.62 GiB memory
[rank7]: in use. Of the allocated memory 13.42 GiB is allocated by PyTorch, and 438.50 KiB is reserved by PyTorch but unallocated. If
[rank7]: reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See
[rank7]: documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants