Skip to content

【4090 24G Qwen2.5-7B显存不足】能否支持逐步或按层转换 #41

@yangxianpku

Description

@yangxianpku

硬件信息:RTX 4090 24G

在4909 GPU上转换Qwen2.5-7B-Instruct模型, 因显存不足报错,看起来模型权重是全量加载的,能否支持分步乃至按层加载转换,可以在小显存GPU上也支持TransMLA。

报错信息如下:

(base) yangxianpku@ubuntu:~/Repos/TransMLA$ bash scripts/qwen2.5-7B-Instruct.sh

============================================================
Original Model

torch_dtype is deprecated! Use dtype instead!
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:02<00:00, 1.90it/s]
Token indices sequence length is longer than the specified maximum sequence length for this model (299078 > 131072). Running this sequence through the model will result in indexing errors
Evaluating original model's ppl: 0%| | 0/5 [00:00<?, ?it/s]
Traceback (most recent call last):
File "/home/yangxianpku/Repos/TransMLA/transmla/converter.py", line 132, in
main(args)
File "/home/yangxianpku/Repos/TransMLA/transmla/converter.py", line 68, in main
dataset_ppl = evaluate_ppl(model, tokenizer.pad_token_id, test_loader, message)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/anacodna3/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 120, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/home/yangxianpku/Repos/TransMLA/transmla/utils.py", line 233, in evaluate_ppl
logits = model(**batch, use_cache=False).logits
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/anacodna3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/anacodna3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1786, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/anacodna3/lib/python3.11/site-packages/transformers/utils/generic.py", line 918, in wrapper
output = func(self, *args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/anacodna3/lib/python3.11/site-packages/transformers/models/qwen2/modeling_qwen2.py", line 449, in forward
outputs: BaseModelOutputWithPast = self.model(
^^^^^^^^^^^
File "/usr/local/anacodna3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/anacodna3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1786, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/anacodna3/lib/python3.11/site-packages/transformers/utils/generic.py", line 1072, in wrapper
outputs = func(self, *args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/anacodna3/lib/python3.11/site-packages/transformers/models/qwen2/modeling_qwen2.py", line 384, in forward
hidden_states = decoder_layer(
^^^^^^^^^^^^^^
File "/usr/local/anacodna3/lib/python3.11/site-packages/transformers/modeling_layers.py", line 94, in call
return super().call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/anacodna3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/anacodna3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1786, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/anacodna3/lib/python3.11/site-packages/transformers/utils/deprecation.py", line 172, in wrapped_func
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/anacodna3/lib/python3.11/site-packages/transformers/models/qwen2/modeling_qwen2.py", line 249, in forward
hidden_states = self.mlp(hidden_states)
^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/anacodna3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/anacodna3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1786, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/anacodna3/lib/python3.11/site-packages/transformers/models/qwen2/modeling_qwen2.py", line 46, in forward
down_proj = self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x))
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 2.31 GiB. GPU 0 has a total capacity of 23.65 GiB of which 297.06 MiB is free. Including non-PyTorch memory, this process has 23.35 GiB memory in use. Of the allocated memory 20.57 GiB is allocated by PyTorch, and 2.32 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions