Tri Dao
|
9e5e8bc91e
Change causal mask to be aligned to bottom-right instead of top-left
|
1 anno fa |
Tri Dao
|
0e8c46ae08
Run isort and black on test files
|
1 anno fa |
Tri Dao
|
c65b5106ac
Fix Bwd NaN for varlen when seqlen_q >> seqlen_k and causal
|
1 anno fa |
Tri Dao
|
3524e13c11
Update to Cutlass 3.1
|
1 anno fa |
Tri Dao
|
1c41d2b0e5
Fix race condition in bwd (overwriting sK)
|
1 anno fa |
Tri Dao
|
a4f148b6ab
Fix masking of bwd when seqlen is not divisible by 128
|
1 anno fa |
Tri Dao
|
4f285b3547
FlashAttention-2 release
|
1 anno fa |
Tri Dao
|
a8fec99a9a
Skip flash_attn_split test
|
2 anni fa |
Tri Dao
|
9d3116addf
Don't enforce bitwise consistency for dq in race condition test
|
2 anni fa |
Tri Dao
|
6998e0ecdb
Fix out-of-bound memory read
|
2 anni fa |
Tri Dao
|
7479757191
Fix pipelining bug in Triton bwd with bias_type=matrix
|
2 anni fa |
Tri Dao
|
557781933d
Parallelize CUDA bwd along seqlen_k instead of seqlen_q
|
2 anni fa |
Tri Dao
|
ff78ea4123
Fix race condition in Triton bwd when there's bias
|
2 anni fa |
Tri Dao
|
86862cfd7b
Implement attention bias for Triton version
|
2 anni fa |
Tri Dao
|
aacc10fbab
Fix race condition in Triton bwd for non-po2 headdims
|
2 anni fa |
Tri Dao
|
1fb12afdfb
Avoid memcpy in the Triton bwd
|
2 anni fa |
Tri Dao
|
9b0bc97872
Fix race condition in Triton fwd
|
2 anni fa |
Tri Dao
|
4f81aff46e
Add debug_barrier for all headdims in Triton bwd
|
2 anni fa |
Tri Dao
|
e78d509c64
[WIP] Support all head dimensions up to 128 in the Triton bwd
|
2 anni fa |
Tri Dao
|
008951f1d9
Support all head dimensions up to 128 in the Triton fwd
|
2 anni fa |
Tri Dao
|
b910bf14c1
Support arbitrary seqlens (both q & k) in Triton bwd
|
2 anni fa |
Tri Dao
|
dc55469355
Support arbitrary seqlen_k in Triton bwd
|
2 anni fa |
Tri Dao
|
d11341fd1a
Fix Triton fwd to support seqlen not multiples of 128
|
2 anni fa |
Tri Dao
|
b0c0db81f6
Implement FlashAttention in Triton
|
2 anni fa |
Tri Dao
|
46fd2a20b2
Support all head dims that are multiples of 8, up to 128
|
2 anni fa |
Tri Dao
|
a5a8806d1a
Split bwd on the seqlen_q dimension
|
2 anni fa |
Tri Dao
|
1aa6d7d9b6
Rework dropout to decouple forward and backward
|
2 anni fa |
Tri Dao
|
52fb4b729b
Fix #54: set device for multi-GPU case
|
2 anni fa |
Tri Dao
|
5badfb7848
Implement attention kernel that splits the batch into two
|
2 anni fa |
Tri Dao
|
0c01568daf
Only run backward test for d=128 on A100
|
2 anni fa |