Tri Dao
|
6807b1ea37
Longest-processing-time-first scheduler for causal
|
1 hari lalu |
Tri Dao
|
fb9c9cbbe9
Support qkv_descale of shape (batch_size, nheads_kv)
|
3 hari lalu |
Tri Dao
|
6293008748
Add option for Mma0_is_RS and Mma1_is_RS in attn fwd
|
1 Minggu lalu |
Tri Dao
|
88fdffc16e
Fix test for softcap FP8
|
1 Minggu lalu |
Tri Dao
|
f5e89ff136
Tune tile size for bwd softcap
|
1 Minggu lalu |
Tri Dao
|
29cdfedd80
Use Bulk reduce instead of TMA for dQaccum, split across WGs
|
1 Minggu lalu |
Tri Dao
|
9c954f7021
Use num_split_heuristics in fwd and fwd_varlen
|
1 Minggu lalu |
Tri Dao
|
199c82052c
Fix test for has_batch_idx
|
2 minggu lalu |
Tri Dao
|
42fc4962f0
Uncomment tanh softcapping
|
2 minggu lalu |
Tri Dao
|
3248babb9e
QOL: Use env var to selectively disable features
|
2 minggu lalu |
Tri Dao
|
c9c40eba83
Uncomment local attn
|
2 minggu lalu |
Tri Dao
|
94657af3e8
Add option for not doing intra-WG overlapping of gemm and softmax
|
3 minggu lalu |
Tri Dao
|
f0b5a6ec4c
Wait for barrier_O at load_tail to avoid Cluster error
|
3 minggu lalu |
Tri Dao
|
fc2fd95a18
Renable FP8 kernels
|
3 minggu lalu |
Tri Dao
|
3d0d147940
Early stop on actual num_splits in mha_combine kernel
|
3 minggu lalu |
Tri Dao
|
1dc3364774
Consolidate seqlen info into a struct
|
3 minggu lalu |
Tri Dao
|
0c49ac9a07
Implement rotary non-interleaved
|
3 minggu lalu |
Tri Dao
|
9f82a326ad
Implement rotary for attn decode
|
4 minggu lalu |
Tri Dao
|
4d00645c76
Implement appending new KV to KV cache
|
1 bulan lalu |
Tri Dao
|
82c1aa3514
Move PackGQA epilogue code to pack_gqa.h
|
1 bulan lalu |
Tri Dao
|
df96486c31
Decode: varlen, paged KV, leftpad
|
1 bulan lalu |
Tri Dao
|
ea7a98f15d
Fix backward with softcap
|
2 bulan lalu |
Tri Dao
|
6e8b25e426
Refactor
|
2 bulan lalu |
Ying Zhang
|
be6c1b98c4
small fixes
|
3 bulan lalu |
Ying Zhang
|
dff976a84a
fixes
|
3 bulan lalu |
Ying Zhang
|
8cbc8a042f
small fixes
|
3 bulan lalu |
Ying Zhang
|
db80387343
Add seqused_q in fwd / bwd and seqused_k in bwd.
|
3 bulan lalu |
jayhshah
|
c92ca63268
FA3 FP8 qkv descales + restore max offset for h128 causal + added sync for producer WG (#1173)
|
3 bulan lalu |
Ying Zhang
|
53537da422
add a unittest
|
4 bulan lalu |
Tri Dao
|
c33de664a1
Fix import in test
|
4 bulan lalu |