Tri Dao
|
76f14c61c9
Tune fwd tile sizes for Sm86 and Sm89
|
3 周之前 |
Tri Dao
|
c3cdc0fd88
Add sm_margin as an option for overlapping with communication
|
1 月之前 |
Tri Dao
|
3e5d77a102
Group instantiations for different hdims together
|
1 月之前 |
Tri Dao
|
fb9c9cbbe9
Support qkv_descale of shape (batch_size, nheads_kv)
|
1 月之前 |
Tri Dao
|
9c954f7021
Use num_split_heuristics in fwd and fwd_varlen
|
1 月之前 |
Tri Dao
|
9fd6b977bb
Precompute the pointers in mha_combine kernel
|
2 月之前 |
Tri Dao
|
64d92bce53
Split PagedKV into separate .cu files to speed up compilation
|
2 月之前 |
Tri Dao
|
018b9af683
Move .cu files to instantiations, use generate_kernels.py
|
2 月之前 |
Tri Dao
|
9f82a326ad
Implement rotary for attn decode
|
2 月之前 |
Tri Dao
|
4d00645c76
Implement appending new KV to KV cache
|
2 月之前 |
Tri Dao
|
a65af55f4a
Move mask_fn and load_Q into separate functions
|
2 月之前 |
Tri Dao
|
df96486c31
Decode: varlen, paged KV, leftpad
|
2 月之前 |
Tri Dao
|
6e8b25e426
Refactor
|
3 月之前 |
Ying Zhang
|
7b4e68e04f
hopper local attention
|
5 月之前 |
Ying Zhang
|
db80387343
Add seqused_q in fwd / bwd and seqused_k in bwd.
|
5 月之前 |
jayhshah
|
c92ca63268
FA3 FP8 qkv descales + restore max offset for h128 causal + added sync for producer WG (#1173)
|
5 月之前 |
Tri Dao
|
bafe253042
[FA3] Bwd
|
6 月之前 |
Ying Zhang
|
dfe1a59e4b
Add var-seq-len to FA3 fp16 / bf16 fwd (#1072)
|
6 月之前 |
Cameron Shinn
|
cb516f855b
Remove torchlib dependency from cpp files (#1083)
|
6 月之前 |
Tri Dao
|
74b0761ff7
[FA3] BF16 forward
|
6 月之前 |
Tri Dao
|
7f67966cc7
FA3 initial code release
|
6 月之前 |