Tri Dao
|
fb9c9cbbe9
Support qkv_descale of shape (batch_size, nheads_kv)
|
hace 2 días |
Tri Dao
|
6293008748
Add option for Mma0_is_RS and Mma1_is_RS in attn fwd
|
hace 6 días |
Tri Dao
|
ae3c1fb3e0
Simplify bwd by setting NumdQWarpGroups = NumMmaWarpGroups
|
hace 6 días |
Tri Dao
|
2c996ca25f
Use SeqlenInfo for bwd and epilogue
|
hace 1 semana |
Tri Dao
|
88fdffc16e
Fix test for softcap FP8
|
hace 1 semana |
Tri Dao
|
f5e89ff136
Tune tile size for bwd softcap
|
hace 1 semana |
Tri Dao
|
3b6ac2b954
Use compile time constants in local mask
|
hace 1 semana |
Tri Dao
|
bfbaafd043
Fix bwd reading out of out LSE
|
hace 1 semana |
Tri Dao
|
29cdfedd80
Use Bulk reduce instead of TMA for dQaccum, split across WGs
|
hace 1 semana |
Tri Dao
|
9c954f7021
Use num_split_heuristics in fwd and fwd_varlen
|
hace 1 semana |
Tri Dao
|
314b9edfc0
Don't need to link to cuda lib anymore
|
hace 1 semana |
Tri Dao
|
f6e165becf
Change tile_size and local to avoid wgmma being serialized
|
hace 1 semana |
Tri Dao
|
f11624b746
Disable --split-compile due to ptxas register allocation failure
|
hace 1 semana |
Tri Dao
|
e8a1edbeb2
Clean up some #include
|
hace 1 semana |
Tri Dao
|
8ae77ea17c
Download nvcc 12.3 to compile for best perf
|
hace 1 semana |
Tri Dao
|
199c82052c
Fix test for has_batch_idx
|
hace 1 semana |
Tri Dao
|
3ed79742fb
Add option to shuffle LSE and dPsum in the bwd
|
hace 2 semanas |
Tri Dao
|
42fc4962f0
Uncomment tanh softcapping
|
hace 2 semanas |
Tri Dao
|
6bc55b571c
Use --split-compile to speed up compilation
|
hace 2 semanas |
Tri Dao
|
82dc825759
Don't use the unsafe convert_type function
|
hace 2 semanas |
Tri Dao
|
9553b2728f
More env vars to disable features
|
hace 2 semanas |
Tri Dao
|
3248babb9e
QOL: Use env var to selectively disable features
|
hace 2 semanas |
Tri Dao
|
c9c40eba83
Uncomment local attn
|
hace 2 semanas |
Tri Dao
|
94657af3e8
Add option for not doing intra-WG overlapping of gemm and softmax
|
hace 2 semanas |
Tri Dao
|
a4d41d2605
Fix epilogue compilation
|
hace 2 semanas |
Tri Dao
|
f0b5a6ec4c
Wait for barrier_O at load_tail to avoid Cluster error
|
hace 2 semanas |
Tri Dao
|
95ba9e51e5
Simplify epilogue when split by using thread_mma.partition_C
|
hace 3 semanas |
Tri Dao
|
47d4d2a76d
Fix FP8 hdim 256 perf regression
|
hace 3 semanas |
Tri Dao
|
fc2fd95a18
Renable FP8 kernels
|
hace 3 semanas |
Tri Dao
|
e7b93e3902
Clean up mha_combine kernel
|
hace 3 semanas |