You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The prefill attention kernel performance has degraded significantly in
recent releases (since v0.1.2), especially on A100 when `causal=True`,
this is mainly because we add new attention variants (which increases
register usage thus incurs register spilling) and move some parameters
from compile-time to runtime.
This PR alleviates the issue by caching some of the variables regarding
GQA group size.
In the next PR, we will support another mode `kv_head_major` in addition
to `qo_head_major`, to further accelerate GQA prefill with query size >=
64.
cc @AKKamath
0 commit comments