Jay Shah
|
c272619bf5
disable clusters
|
1 month ago |
Jay Shah
|
bb169e2a54
revert to single tile scheduler
|
1 month ago |
Jay Shah
|
483b26e541
add working fp8 varlen
|
1 month ago |
Son Nguyen
|
478ee666cc
Make namespace comment consistent (#1305)
|
1 month ago |
milesvant
|
c1d146cbd5
Fix copy-paste error in hopper tests (#1279)
|
2 months ago |
jayhshah
|
a5a75274bc
FA3 kvcache + split kv + gqa parallelization (#1236)
|
2 months ago |
Tri Dao
|
bedf877467
[CrossEntropy] Fix where labels address not aligned to 16 bytes
|
2 months ago |
rocking
|
53a4f34163
Hotfix due to change of upstream api (#1239)
|
2 months ago |
hlky
|
8476986721
Fix FAv3 compilation with MSVC (#1240)
|
2 months ago |
Ying Zhang
|
9cafd4ae14
Merge pull request #1233 from Dao-AILab/ipiszy/local_attn
|
2 months ago |
Ying Zhang
|
1c9717d699
address comments
|
2 months ago |
Zhihao Shen
|
30e1ef0f79
minify torch.torch.int32 to torch.int32 (#1237)
|
2 months ago |
Antoni Viros
|
83e41b3ca4
Add custom ops for compatibility with PT Compile (#1139)
|
2 months ago |
Ying Zhang
|
be6c1b98c4
small fixes
|
3 months ago |
Ying Zhang
|
dff976a84a
fixes
|
3 months ago |
Ying Zhang
|
7b4e68e04f
hopper local attention
|
3 months ago |
Ying Zhang
|
af314d4006
Merge pull request #1182 from ipiszy/used_q
|
3 months ago |
Ying Zhang
|
8cbc8a042f
small fixes
|
3 months ago |
Ying Zhang
|
cdbbe844b1
minor changes to unpad_input test util func
|
3 months ago |
Ying Zhang
|
db80387343
Add seqused_q in fwd / bwd and seqused_k in bwd.
|
3 months ago |
rocking
|
e2182cc21d
Support page kvcache in AMD ROCm (#1198)
|
3 months ago |
Tri Dao
|
cc1690d9d6
[Rotary] Add test for rotary when qkv are packed an there's GQA
|
3 months ago |
Tri Dao
|
8c20cfef49
[Rotary] Support qkv block layout from GQA
|
3 months ago |
Charlene Yang
|
bdf733be55
Add q, k, v descales to FA3 interface (#1210)
|
3 months ago |
Tri Dao
|
c7f32a8409
[CrossEntropy] Support precomputed LSE
|
3 months ago |
juejuezi
|
e371bea04f
feat: change minimal supported CUDA version to 11.7 (#1206)
|
3 months ago |
Cameron Shinn
|
3cea2fb6ee
Add ArchTag to pre/postprocess bwd kernels (#1180)
|
3 months ago |
jayhshah
|
c92ca63268
FA3 FP8 qkv descales + restore max offset for h128 causal + added sync for producer WG (#1173)
|
3 months ago |
Tri Dao
|
d79f9b41a8
[CrossEntropy] Use online softmax to simplify implementation
|
3 months ago |
Jay Shah
|
32792d37ec
add missing if condition for key_padding_mask in test_util.py
|
3 months ago |