-
Notifications
You must be signed in to change notification settings - Fork 26
Description
硬件信息:RTX 4090 24G
在4909 GPU上转换Qwen2.5-7B-Instruct模型, 因显存不足报错,看起来模型权重是全量加载的,能否支持分步乃至按层加载转换,可以在小显存GPU上也支持TransMLA。
报错信息如下:
(base) yangxianpku@ubuntu:~/Repos/TransMLA$ bash scripts/qwen2.5-7B-Instruct.sh
============================================================
Original Model
torch_dtype is deprecated! Use dtype instead!
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:02<00:00, 1.90it/s]
Token indices sequence length is longer than the specified maximum sequence length for this model (299078 > 131072). Running this sequence through the model will result in indexing errors
Evaluating original model's ppl: 0%| | 0/5 [00:00<?, ?it/s]
Traceback (most recent call last):
File "/home/yangxianpku/Repos/TransMLA/transmla/converter.py", line 132, in
main(args)
File "/home/yangxianpku/Repos/TransMLA/transmla/converter.py", line 68, in main
dataset_ppl = evaluate_ppl(model, tokenizer.pad_token_id, test_loader, message)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/anacodna3/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 120, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/home/yangxianpku/Repos/TransMLA/transmla/utils.py", line 233, in evaluate_ppl
logits = model(**batch, use_cache=False).logits
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/anacodna3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/anacodna3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1786, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/anacodna3/lib/python3.11/site-packages/transformers/utils/generic.py", line 918, in wrapper
output = func(self, *args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/anacodna3/lib/python3.11/site-packages/transformers/models/qwen2/modeling_qwen2.py", line 449, in forward
outputs: BaseModelOutputWithPast = self.model(
^^^^^^^^^^^
File "/usr/local/anacodna3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/anacodna3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1786, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/anacodna3/lib/python3.11/site-packages/transformers/utils/generic.py", line 1072, in wrapper
outputs = func(self, *args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/anacodna3/lib/python3.11/site-packages/transformers/models/qwen2/modeling_qwen2.py", line 384, in forward
hidden_states = decoder_layer(
^^^^^^^^^^^^^^
File "/usr/local/anacodna3/lib/python3.11/site-packages/transformers/modeling_layers.py", line 94, in call
return super().call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/anacodna3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/anacodna3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1786, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/anacodna3/lib/python3.11/site-packages/transformers/utils/deprecation.py", line 172, in wrapped_func
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/anacodna3/lib/python3.11/site-packages/transformers/models/qwen2/modeling_qwen2.py", line 249, in forward
hidden_states = self.mlp(hidden_states)
^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/anacodna3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/anacodna3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1786, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/anacodna3/lib/python3.11/site-packages/transformers/models/qwen2/modeling_qwen2.py", line 46, in forward
down_proj = self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x))
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 2.31 GiB. GPU 0 has a total capacity of 23.65 GiB of which 297.06 MiB is free. Including non-PyTorch memory, this process has 23.35 GiB memory in use. Of the allocated memory 20.57 GiB is allocated by PyTorch, and 2.32 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)