Fix FMHA bwd_nl kernel issue for batch_size == 3 #1975

crcrpar · 2026-01-14T00:09:34Z

The bwd_nl (no-loop backward) kernel produces incorrect results for batch_size == 3 due to a num_chunks configuration issue in the CUDA kernel. For batch_size == 3, the default num_chunks = 2 is used, but this configuration is buggy (produces 0.45 error vs expected 0.001).

This fix changes the condition from batch_size < 4 to batch_size <= 2 for using the bwd_nl path, falling back to the regular bwd function for batch_size >= 3 which works correctly.

Test results before fix:

b=2: bwd_nl works (max diff: 0.001)
b=3: bwd_nl FAILS (max diff: 0.452)
b=4: bwd works (max diff: 0.0005)

Test results after fix:

b=2: bwd_nl works (max diff: 0.001)
b=3: bwd works (max diff: 0.0005)
b=4: bwd works (max diff: 0.0005)

Used Claude Opus 4.5

The bwd_nl (no-loop backward) kernel produces incorrect results for batch_size == 3 due to a num_chunks configuration issue in the CUDA kernel. For batch_size == 3, the default num_chunks = 2 is used, but this configuration is buggy (produces 0.45 error vs expected 0.001). This fix changes the condition from `batch_size < 4` to `batch_size <= 2` for using the bwd_nl path, falling back to the regular bwd function for batch_size >= 3 which works correctly. Test results before fix: - b=2: bwd_nl works (max diff: 0.001) - b=3: bwd_nl FAILS (max diff: 0.452) - b=4: bwd works (max diff: 0.0005) Test results after fix: - b=2: bwd_nl works (max diff: 0.001) - b=3: bwd works (max diff: 0.0005) - b=4: bwd works (max diff: 0.0005) Signed-off-by: Masaki Kozuki <mkozuki@nvidia.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix FMHA bwd_nl kernel issue for batch_size == 3 #1975

Fix FMHA bwd_nl kernel issue for batch_size == 3 #1975

Uh oh!

crcrpar commented Jan 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Fix FMHA bwd_nl kernel issue for batch_size == 3 #1975

Are you sure you want to change the base?

Fix FMHA bwd_nl kernel issue for batch_size == 3 #1975

Uh oh!

Conversation

crcrpar commented Jan 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant