Tri Dao
|
6807b1ea37
Longest-processing-time-first scheduler for causal
|
14 godzin temu |
Tri Dao
|
fb9c9cbbe9
Support qkv_descale of shape (batch_size, nheads_kv)
|
3 dni temu |
Tri Dao
|
6293008748
Add option for Mma0_is_RS and Mma1_is_RS in attn fwd
|
1 tydzień temu |
Tri Dao
|
9c954f7021
Use num_split_heuristics in fwd and fwd_varlen
|
1 tydzień temu |
Tri Dao
|
f6e165becf
Change tile_size and local to avoid wgmma being serialized
|
1 tydzień temu |
Tri Dao
|
42fc4962f0
Uncomment tanh softcapping
|
2 tygodni temu |
Tri Dao
|
9553b2728f
More env vars to disable features
|
2 tygodni temu |
Tri Dao
|
3248babb9e
QOL: Use env var to selectively disable features
|
2 tygodni temu |
Tri Dao
|
c9c40eba83
Uncomment local attn
|
2 tygodni temu |
Tri Dao
|
94657af3e8
Add option for not doing intra-WG overlapping of gemm and softmax
|
3 tygodni temu |
Tri Dao
|
fc2fd95a18
Renable FP8 kernels
|
3 tygodni temu |
Tri Dao
|
64d92bce53
Split PagedKV into separate .cu files to speed up compilation
|
3 tygodni temu |
Tri Dao
|
bc8a001d8d
Load cos/sin by splitting the work among threads on the same row
|
3 tygodni temu |
Tri Dao
|
1dc3364774
Consolidate seqlen info into a struct
|
3 tygodni temu |
Tri Dao
|
586ba914bb
Move fwd tile size to a separate file
|
3 tygodni temu |
Tri Dao
|
018b9af683
Move .cu files to instantiations, use generate_kernels.py
|
3 tygodni temu |
Tri Dao
|
0c49ac9a07
Implement rotary non-interleaved
|
3 tygodni temu |
Tri Dao
|
b2d3fe92ff
Move rotary to a separate file
|
4 tygodni temu |
Tri Dao
|
9f82a326ad
Implement rotary for attn decode
|
4 tygodni temu |
Tri Dao
|
4d00645c76
Implement appending new KV to KV cache
|
1 miesiąc temu |
Tri Dao
|
2e4eabd082
Move barrier_O arrive from mainloop to epilogue to simplify
|
1 miesiąc temu |
Tri Dao
|
df96486c31
Decode: varlen, paged KV, leftpad
|
1 miesiąc temu |
Tri Dao
|
6e8b25e426
Refactor
|
2 miesięcy temu |
Ying Zhang
|
1c9717d699
address comments
|
2 miesięcy temu |
Ying Zhang
|
dff976a84a
fixes
|
3 miesięcy temu |
Ying Zhang
|
7b4e68e04f
hopper local attention
|
3 miesięcy temu |
Ying Zhang
|
db80387343
Add seqused_q in fwd / bwd and seqused_k in bwd.
|
3 miesięcy temu |
jayhshah
|
c92ca63268
FA3 FP8 qkv descales + restore max offset for h128 causal + added sync for producer WG (#1173)
|
3 miesięcy temu |
Tri Dao
|
bafe253042
[FA3] Bwd
|
4 miesięcy temu |
jayhshah
|
5018ac6ac5
Fp8 kernel with "in-kernel" transpose of V in producer (#1100)
|
4 miesięcy temu |