Tri Dao
|
fb9c9cbbe9
Support qkv_descale of shape (batch_size, nheads_kv)
|
1 month ago |
Tri Dao
|
94657af3e8
Add option for not doing intra-WG overlapping of gemm and softmax
|
1 month ago |
Tri Dao
|
f0b5a6ec4c
Wait for barrier_O at load_tail to avoid Cluster error
|
1 month ago |
Tri Dao
|
fc2fd95a18
Renable FP8 kernels
|
1 month ago |
Tri Dao
|
5f8297808d
Don't need to write zero to output if Split
|
1 month ago |
Tri Dao
|
3d0d147940
Early stop on actual num_splits in mha_combine kernel
|
1 month ago |
Tri Dao
|
1dc3364774
Consolidate seqlen info into a struct
|
2 months ago |
Tri Dao
|
f6fd36f4b9
Move the check n_block_max<=n_block_min from fwd_kernel to mainloop
|
2 months ago |
Tri Dao
|
c5ba47b3d5
Add fence.async to epilogue
|
2 months ago |
Tri Dao
|
9f82a326ad
Implement rotary for attn decode
|
2 months ago |
Tri Dao
|
5e628dcb5b
Don't need membar.cta
|
2 months ago |
Tri Dao
|
4d00645c76
Implement appending new KV to KV cache
|
2 months ago |
Tri Dao
|
2e4eabd082
Move barrier_O arrive from mainloop to epilogue to simplify
|
2 months ago |
Tri Dao
|
f5bd27d778
Move PackGQA load_Q to a separate file
|
2 months ago |
Tri Dao
|
df96486c31
Decode: varlen, paged KV, leftpad
|
2 months ago |
Tri Dao
|
6e8b25e426
Refactor
|
3 months ago |
Ying Zhang
|
7b4e68e04f
hopper local attention
|
4 months ago |
jayhshah
|
c92ca63268
FA3 FP8 qkv descales + restore max offset for h128 causal + added sync for producer WG (#1173)
|
4 months ago |
Ying Zhang
|
a3a257c71d
Fix out-of-bound writes for var-seq-len zero-length KVs
|
5 months ago |
jayhshah
|
5018ac6ac5
Fp8 kernel with "in-kernel" transpose of V in producer (#1100)
|
5 months ago |
Ying Zhang
|
dfe1a59e4b
Add var-seq-len to FA3 fp16 / bf16 fwd (#1072)
|
6 months ago |
Tri Dao
|
74b0761ff7
[FA3] BF16 forward
|
6 months ago |
Tri Dao
|
7f67966cc7
FA3 initial code release
|
6 months ago |