Tri Dao
|
29cdfedd80
Use Bulk reduce instead of TMA for dQaccum, split across WGs
|
1 month ago |
Tri Dao
|
c5ba47b3d5
Add fence.async to epilogue
|
1 month ago |
Tri Dao
|
9f82a326ad
Implement rotary for attn decode
|
1 month ago |
Tri Dao
|
4d00645c76
Implement appending new KV to KV cache
|
1 month ago |
Tri Dao
|
df96486c31
Decode: varlen, paged KV, leftpad
|
1 month ago |
Tri Dao
|
6e8b25e426
Refactor
|
3 months ago |
Tri Dao
|
bafe253042
[FA3] Bwd
|
5 months ago |
jayhshah
|
5018ac6ac5
Fp8 kernel with "in-kernel" transpose of V in producer (#1100)
|
5 months ago |
Tri Dao
|
74b0761ff7
[FA3] BF16 forward
|
5 months ago |