Ying Zhang
|
496fdc4f6c
Add seqused_q in fwd / bwd and seqused_k in bwd.
|
4 months ago |
Cameron Shinn
|
3cea2fb6ee
Add ArchTag to pre/postprocess bwd kernels (#1180)
|
4 months ago |
jayhshah
|
c92ca63268
FA3 FP8 qkv descales + restore max offset for h128 causal + added sync for producer WG (#1173)
|
4 months ago |
Tri Dao
|
d79f9b41a8
[CrossEntropy] Use online softmax to simplify implementation
|
4 months ago |
Jay Shah
|
32792d37ec
add missing if condition for key_padding_mask in test_util.py
|
4 months ago |
Ying Zhang
|
28e7f4ddbd
Merge pull request #1155 from ipiszy/fix
|
4 months ago |
Ying Zhang
|
53537da422
add a unittest
|
4 months ago |
Ying Zhang
|
a3a257c71d
Fix out-of-bound writes for var-seq-len zero-length KVs
|
4 months ago |
Tri Dao
|
bcd918f275
[LayerNorm] Add option to write result to out and residual_out
|
4 months ago |
Tri Dao
|
bd82d6c6eb
Revert "[LayerNorm] Don't store x + residual if we don't need gradients"
|
4 months ago |
Tri Dao
|
800401847e
[LayerNorm] Don't store x + residual if we don't need gradients
|
4 months ago |
Garrett Byrd
|
16025d8cc9
Clearer install instructions for CUDA and ROCm backends (#1147)
|
4 months ago |
Ying Zhang
|
3669b25206
bwd benchmark + small fixes (#1129)
|
5 months ago |
Tri Dao
|
5d5bfbb619
Remove contiguous checks
|
5 months ago |
SueJane
|
3f1b4d38e7
Fix: check the type of max_seqlen_k instead of checking max_seqlen twice (#1127)
|
5 months ago |
Tri Dao
|
3f6ff1c1c5
Remove struct : cute::aligned_struct to avoid error with gcc 12
|
5 months ago |
Tri Dao
|
c33de664a1
Fix import in test
|
5 months ago |
Tri Dao
|
bafe253042
[FA3] Bwd
|
5 months ago |
Ying Zhang
|
abffb0f98c
Merge pull request #1115 from ipiszy/bench
|
5 months ago |
Ying Zhang
|
c7f20a2d31
add cudnn benchmark for var-len
|
5 months ago |
jayhshah
|
5018ac6ac5
Fp8 kernel with "in-kernel" transpose of V in producer (#1100)
|
5 months ago |
Tri Dao
|
c4b9015d74
Add benchmark_gemm.py
|
5 months ago |
Tri Dao
|
418d677192
Bump to v2.6.3
|
5 months ago |
Tri Dao
|
65205d350e
[CI] Compile for pytorch 2.4.0
|
5 months ago |
Tri Dao
|
3aae9c18c1
Revert "Changes For FP8 (#1075)"
|
5 months ago |
ganeshcolfax
|
1899c970c8
Changes For FP8 (#1075)
|
5 months ago |
Tri Dao
|
59594f2a67
Bump to v2.6.2
|
5 months ago |
Tri Dao
|
299563626f
Fix test with alibi and cache_leftpad
|
5 months ago |
Tri Dao
|
4488acee8d
[CI] Compile with torch 2.4.0.dev20240527
|
5 months ago |
Tri Dao
|
65f723bb9a
Split bwd into more .cu files to speed up compilation
|
5 months ago |