Skip to content

Conversation

@sssshhhhhh
Copy link

@sssshhhhhh sssshhhhhh commented Jan 7, 2026

Some assertions were wrong and MockModel init didn't set some variables causing wrong branches. Also swapped dims to prevent D_HEAD == NUM_HEADS. Possibly fixes #1968

Test failing is expected as mqa doesn't support knorm. This should also use _merge_time_and_head_dims because kv_heads can be 1 when it shouldn't be merged (tp/t5 relative pos)

if (_num_heads_kv == 1) { // MQA (Multi-Query Attention)
if (values_padder)
values_padder->add_padding(fused_proj);
ops::Split(2, {_d_head, _d_head})(fused_proj, keys_proj, values_proj);

@jordimas
Copy link
Collaborator

jordimas commented Jan 8, 2026

Thanks. Yes, this is similar to the fix master...jordimas:CTranslate2:cross_attention that I started to work for #1968.

@sssshhhhhh
Copy link
Author

Nice. btw layernorm/rmsnorm can be inplace, transformer.cc does it in multiple places. Also what do you think about an attention refactor somewhat like this merging attention_layer.cc/attention.cc/flash_attention.cc: https://github.com/sssshhhhhh/CTranslate2/blob/rocm/src/layers/rocm_attention.cc
The fa2 path is falling behind (its support wasn't great in the first place) and does a lot of duplicate steps, if fa2 gets limited to softmax(QK)V op instead of a separate layer I think it would be simpler. To match current fa2 support it needs to handle both BNTH/BTNH qkv proj layout paths; and move head replication/rope to dpa.

@sssshhhhhh sssshhhhhh closed this Jan 8, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[issue] Heap buffer overflow detected during AddressSanitizer run

2 participants