Fix issue with recurrent part of chunk-wise backward computations #10

hoedt · 2025-06-10T09:18:33Z

When chunk_size_inter is less than chunk_size_intra (e.g. for chunk_size values greater than 128 --- the default chunk-size value in the heuristics), the recurrent state gradients (and therefore also gradients w.r.t. K and V) were computed incorrectly.

It turns out that the backward kernel assumed that all recurrent states would be available. However, in the forward pass, only the states that are necessary for the parallel part (which are fewer if chunk_size_intra is greater than chunk_size_inter) are stored. This fix simply uses the available states correctly to compute the recurrent gradients.

hoedt added 4 commits June 10, 2025 10:22

hot-fix for recurrent backward bugÃ

e993763

use scaM_inter shape to infer chunk size

a9edba2

remove 'save_every_nth_chunk'-logic from kernel

29bb53d

add test to avoid future errors

027f051

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix issue with recurrent part of chunk-wise backward computations #10

Fix issue with recurrent part of chunk-wise backward computations #10

Uh oh!

hoedt commented Jun 10, 2025

Uh oh!

Uh oh!

Fix issue with recurrent part of chunk-wise backward computations #10

Are you sure you want to change the base?

Fix issue with recurrent part of chunk-wise backward computations #10

Uh oh!

Conversation

hoedt commented Jun 10, 2025

Uh oh!

Uh oh!