Tri Dao
|
0519920e23
Deal with the case where q or k/v have length 0
|
2 weeks ago |
Tri Dao
|
39afd52bd2
Actually fix window_size for bwd pass
|
2 weeks ago |
Tri Dao
|
a44cd67d3f
Move testing util functions to a separate file
|
2 weeks ago |
Tri Dao
|
a609d82315
Change extension name to flash_attn_3_cuda
|
2 weeks ago |
Tri Dao
|
f907a13187
Tune tile sizes for fwd varlen on Sm80 and Sm86
|
2 weeks ago |
Tri Dao
|
76f14c61c9
Tune fwd tile sizes for Sm86 and Sm89
|
3 weeks ago |
Tri Dao
|
51484a7b56
Make backward epilogue work for Sm80
|
3 weeks ago |
Tri Dao
|
69bd392159
Merge bwd and bwd_varlen in the C++ API
|
1 month ago |
Tri Dao
|
c3cdc0fd88
Add sm_margin as an option for overlapping with communication
|
1 month ago |
Tri Dao
|
7f5d73a162
Add env var to disable specific hdim
|
1 month ago |
Tri Dao
|
234c557190
Fix kvcache test in the case with cu_seqlens_k_new
|
1 month ago |
Tri Dao
|
ba2061dfe8
Support cu_seqlens_k_new in flash_attn_with_kvcache
|
1 month ago |
Tri Dao
|
6807b1ea37
Longest-processing-time-first scheduler for causal
|
1 month ago |
Tri Dao
|
fb9c9cbbe9
Support qkv_descale of shape (batch_size, nheads_kv)
|
1 month ago |
Tri Dao
|
6293008748
Add option for Mma0_is_RS and Mma1_is_RS in attn fwd
|
1 month ago |
Tri Dao
|
88fdffc16e
Fix test for softcap FP8
|
1 month ago |
Tri Dao
|
f5e89ff136
Tune tile size for bwd softcap
|
1 month ago |
Tri Dao
|
29cdfedd80
Use Bulk reduce instead of TMA for dQaccum, split across WGs
|
1 month ago |
Tri Dao
|
9c954f7021
Use num_split_heuristics in fwd and fwd_varlen
|
1 month ago |
Tri Dao
|
199c82052c
Fix test for has_batch_idx
|
1 month ago |
Tri Dao
|
42fc4962f0
Uncomment tanh softcapping
|
2 months ago |
Tri Dao
|
3248babb9e
QOL: Use env var to selectively disable features
|
2 months ago |
Tri Dao
|
c9c40eba83
Uncomment local attn
|
2 months ago |
Tri Dao
|
94657af3e8
Add option for not doing intra-WG overlapping of gemm and softmax
|
2 months ago |
Tri Dao
|
f0b5a6ec4c
Wait for barrier_O at load_tail to avoid Cluster error
|
2 months ago |
Tri Dao
|
fc2fd95a18
Renable FP8 kernels
|
2 months ago |
Tri Dao
|
3d0d147940
Early stop on actual num_splits in mha_combine kernel
|
2 months ago |
Tri Dao
|
1dc3364774
Consolidate seqlen info into a struct
|
2 months ago |
Tri Dao
|
0c49ac9a07
Implement rotary non-interleaved
|
2 months ago |
Tri Dao
|
9f82a326ad
Implement rotary for attn decode
|
2 months ago |