Ying Zhang
|
cdbbe844b1
minor changes to unpad_input test util func
|
3 meses atrás |
Tri Dao
|
299563626f
Fix test with alibi and cache_leftpad
|
4 meses atrás |
Tri Dao
|
751c762c9c
Don't specialize for hdim 224 to speed up compilation
|
4 meses atrás |
Phil Wang
|
5f1ae4a34b
backwards for softcapping (#1033)
|
4 meses atrás |
Tri Dao
|
40e534a7f6
Implement cache_leftpad
|
5 meses atrás |
Tri Dao
|
d0787acc16
Relax dropout_fraction test
|
5 meses atrás |
Tri Dao
|
dca6d89da4
Don't support softcap and dropout at the same time
|
5 meses atrás |
Tri Dao
|
81e01efd4b
More typo fixes
|
5 meses atrás |
Tri Dao
|
3d41db3e2c
Only test backward if there's no softcapping
|
5 meses atrás |
Nicolas Patry
|
8f873cc6ac
Implement softcapping. (#1025)
|
5 meses atrás |
muoshuosha
|
6df7e0a02e
Fix the varlen deterministic test (#1023)
|
5 meses atrás |
cao lei
|
6a2a16e994
fix typo (#974)
|
5 meses atrás |
Grigory Sizov
|
f816dee63c
Support unpadded LSE layout (#970)
|
5 meses atrás |
Grigory Sizov
|
2a15840f09
Enable paged attention in varlen forward (#831)
|
9 meses atrás |
Tri Dao
|
2406f28805
Enable headdim 256 backward on consumer GPUs (Ampere, Ada)
|
9 meses atrás |
Tri Dao
|
54e80a3829
Implement page KV cache
|
10 meses atrás |
Tri Dao
|
10dad61277
apply_dropout now takes tensor of rowcol layout
|
11 meses atrás |
Tri Dao
|
a7b66ae25a
Simplify writing softmax to gmem
|
11 meses atrás |
Tri Dao
|
732654583c
Implement deterministic backward (thanks to Meituan)
|
11 meses atrás |
Tri Dao
|
5ab9b3667b
Clean up alibi, implement non-causal alibi
|
11 meses atrás |
Tri Dao
|
e279bf8ed9
[Gen] Accept cache_batch_idx to index into the KV cache
|
1 ano atrás |
Tri Dao
|
083e8f525f
Implement local attention
|
1 ano atrás |
Tri Dao
|
65c234ed90
Don't over-allocate dq_accum in case of varlen
|
1 ano atrás |
Tri Dao
|
2d8ea9a530
Swap seqlen_q and ngroups when seqlen_q=1 (h/t Daniel Haziza)
|
1 ano atrás |
Tri Dao
|
3250ff3d82
Swap seqlen_q, nheads for MQA when seqlen_q=1 for fwd (h/t Daniel H)
|
1 ano atrás |
Tri Dao
|
ccbb14f38e
Implement rotary embedding in flash_attn_with_kvcache
|
1 ano atrás |
Tri Dao
|
56b7fc6ee0
Simplify the implementation of KVcache attn by appending KV first
|
1 ano atrás |
Tri Dao
|
37c6e05406
Implement flash_attn_with_kvcache
|
1 ano atrás |
Tri Dao
|
0c04943fa2
Require CUDA 11.6+, clean up setup.py
|
1 ano atrás |
Tri Dao
|
b1fbbd8337
Implement splitKV attention
|
1 ano atrás |