XiaobingZhang
|
0dfb281743
don't save inputs buffer of FlashAttenFunc to reduce memory usage for inference mode (#1383)
|
il y a 2 jours |
Michael Melesse
|
b518517cb8
[AMD] Triton Backend for ROCm (#1203)
|
il y a 1 semaine |
Antoni Viros
|
83e41b3ca4
Add custom ops for compatibility with PT Compile (#1139)
|
il y a 2 mois |
youkaichao
|
ef3e358a25
remove lambda (#1056)
|
il y a 4 mois |
Tri Dao
|
898dd4bbf2
Pass seqused_k to _flash_attn_varlen_forward
|
il y a 5 mois |
Tri Dao
|
40e534a7f6
Implement cache_leftpad
|
il y a 5 mois |
Tri Dao
|
81e01efd4b
More typo fixes
|
il y a 5 mois |
Tri Dao
|
72e27c6320
Fix typo with softcapping
|
il y a 5 mois |
Phil Wang
|
f4628b43ec
missing commas and backwards return arguments (#1032)
|
il y a 5 mois |
Nicolas Patry
|
8f873cc6ac
Implement softcapping. (#1025)
|
il y a 5 mois |
Jianwei Dong
|
4e8d60069f
Add the return_softmax_lse parameter to the flash_attn_with_kvcache function to allow returning the logsumexp of the attention scores. (#989)
|
il y a 5 mois |
Grigory Sizov
|
f816dee63c
Support unpadded LSE layout (#970)
|
il y a 5 mois |
Grigory Sizov
|
2a15840f09
Enable paged attention in varlen forward (#831)
|
il y a 9 mois |
Tao He
|
204c3c6d1b
Fixes an error in comment (#785)
|
il y a 10 mois |
Tri Dao
|
54e80a3829
Implement page KV cache
|
il y a 10 mois |
Tri Dao
|
a7b66ae25a
Simplify writing softmax to gmem
|
il y a 11 mois |
Tri Dao
|
732654583c
Implement deterministic backward (thanks to Meituan)
|
il y a 11 mois |
Tri Dao
|
5ab9b3667b
Clean up alibi, implement non-causal alibi
|
il y a 11 mois |
Tri Dao
|
bc28eacc60
Format flash_attn_interface.py
|
il y a 1 an |
Sanghun Cho
|
e4f726fc44
Support alibi, by Sanghun Cho from Kakao Brain
|
il y a 1 an |
Tri Dao
|
d4a7c8ffbb
[CI] Only compile for CUDA 11.8 & 12.2, MAX_JOBS=2,add torch-nightly
|
il y a 1 an |
Jeremy Reizenstein
|
ce3e7280f8
Allow varlen_fwd to take optional seqused_k (#647)
|
il y a 1 an |
Tri Dao
|
e279bf8ed9
[Gen] Accept cache_batch_idx to index into the KV cache
|
il y a 1 an |
Tri Dao
|
083e8f525f
Implement local attention
|
il y a 1 an |
Tri Dao
|
ccbb14f38e
Implement rotary embedding in flash_attn_with_kvcache
|
il y a 1 an |
Tri Dao
|
ee77b931b9
Swap seqlen_q and nheads for MQA to speed it up (h/t Daniel Haziza)
|
il y a 1 an |
Tri Dao
|
fd20f16a4e
Support cache_seqlens being integer
|
il y a 1 an |
Tri Dao
|
37c6e05406
Implement flash_attn_with_kvcache
|
il y a 1 an |
Tri Dao
|
9e5e8bc91e
Change causal mask to be aligned to bottom-right instead of top-left
|
il y a 1 an |
Tri Dao
|
d431f16751
Import torch before flash_attn_2_cuda
|
il y a 1 an |