Skip to content

Conversation

@crcrpar
Copy link
Collaborator

@crcrpar crcrpar commented Jan 14, 2026

The bwd_nl (no-loop backward) kernel produces incorrect results for batch_size == 3 due to a num_chunks configuration issue in the CUDA kernel. For batch_size == 3, the default num_chunks = 2 is used, but this configuration is buggy (produces 0.45 error vs expected 0.001).

This fix changes the condition from batch_size < 4 to batch_size <= 2 for using the bwd_nl path, falling back to the regular bwd function for batch_size >= 3 which works correctly.

Test results before fix:

  • b=2: bwd_nl works (max diff: 0.001)
  • b=3: bwd_nl FAILS (max diff: 0.452)
  • b=4: bwd works (max diff: 0.0005)

Test results after fix:

  • b=2: bwd_nl works (max diff: 0.001)
  • b=3: bwd works (max diff: 0.0005)
  • b=4: bwd works (max diff: 0.0005)

Used Claude Opus 4.5

The bwd_nl (no-loop backward) kernel produces incorrect results for
batch_size == 3 due to a num_chunks configuration issue in the CUDA
kernel. For batch_size == 3, the default num_chunks = 2 is used, but
this configuration is buggy (produces 0.45 error vs expected 0.001).

This fix changes the condition from `batch_size < 4` to `batch_size <= 2`
for using the bwd_nl path, falling back to the regular bwd function for
batch_size >= 3 which works correctly.

Test results before fix:
- b=2: bwd_nl works (max diff: 0.001)
- b=3: bwd_nl FAILS (max diff: 0.452)
- b=4: bwd works (max diff: 0.0005)

Test results after fix:
- b=2: bwd_nl works (max diff: 0.001)
- b=3: bwd works (max diff: 0.0005)
- b=4: bwd works (max diff: 0.0005)

Signed-off-by: Masaki Kozuki <mkozuki@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant