Tri Dao
|
29cdfedd80
Use Bulk reduce instead of TMA for dQaccum, split across WGs
|
hace 1 mes |
Tri Dao
|
c5ba47b3d5
Add fence.async to epilogue
|
hace 1 mes |
Tri Dao
|
9f82a326ad
Implement rotary for attn decode
|
hace 1 mes |
Tri Dao
|
4d00645c76
Implement appending new KV to KV cache
|
hace 1 mes |
Tri Dao
|
df96486c31
Decode: varlen, paged KV, leftpad
|
hace 1 mes |
Tri Dao
|
6e8b25e426
Refactor
|
hace 3 meses |
Tri Dao
|
bafe253042
[FA3] Bwd
|
hace 5 meses |
jayhshah
|
5018ac6ac5
Fp8 kernel with "in-kernel" transpose of V in producer (#1100)
|
hace 5 meses |
Tri Dao
|
74b0761ff7
[FA3] BF16 forward
|
hace 5 meses |