Tri Dao
|
02541ac9e8
[CE] Assert logit_scale > 0
|
1 周之前 |
Tri Dao
|
803f609aa1
Fix comment in assert
|
1 周之前 |
Tri Dao
|
454ce31594
[FA3] Compile with nvcc 12.8 instead of 12.3
|
1 周之前 |
Tri Dao
|
5231d95fe1
Drop Pytorch 2.1
|
1 周之前 |
Tri Dao
|
979702c87a
Bump to v2.7.4
|
1 周之前 |
Tri Dao
|
bb135af07c
Don't compile for CUDA 11, compile for official pytorch 2.6.0
|
1 周之前 |
Aman Karmani
|
cd393e0ace
[Build] Update version of setuptools used to generate core package (#1460)
|
1 周之前 |
Michał Górny
|
6b1d059eda
Support ROCM builds from source distribution, and improve error handling (#1446)
|
3 周之前 |
Driss Guessous
|
bc482cbf91
Add a macro for namespace (#1419)
|
4 周之前 |
Ying Zhang
|
0fcd405cfe
Merge pull request #1442 from houseroad/replace_c10_optional
|
4 周之前 |
Lu Fang
|
74aed78373
Replace c10::optional with std::optional in flash_attn
|
4 周之前 |
Tri Dao
|
22c0358f4b
Fix nvcc_from_env not found
|
1 月之前 |
Kirthi Shankar Sivamani
|
89c5a7dd4e
Change version to 2.7.3 (#1437)
|
1 月之前 |
Tri Dao
|
2ac6c986be
Fix Sm80 tile_count_semaphore, adjust test tolerance
|
1 月之前 |
Kirthi Shankar Sivamani
|
07bddf918a
Blackwell support (#1436)
|
1 月之前 |
rocking
|
77ad12d24e
[AMD ROCm] Support variable length of page attention (#1431)
|
1 月之前 |
Kirthi Shankar Sivamani
|
d57f826835
Expose `zero_tensors` arg in varlen functions (#1433)
|
1 月之前 |
Tri Dao
|
a93359a2bf
If PackGQA, use producer threads instead of Mma threads to load Q
|
1 月之前 |
Tri Dao
|
e94f7e89dc
Always enable PackGQA is Split to reduce compilation and binary size
|
1 月之前 |
Tri Dao
|
40fa35acd8
Always enable PackGQA if PagedKV to reduce compilation and bin size
|
1 月之前 |
Tri Dao
|
a84a237d2a
Split bwd softcap compilation units for Sm80
|
1 月之前 |
Tri Dao
|
518e919a60
Fix softcap compilation
|
1 月之前 |
Tri Dao
|
ea8cd7fe7b
Ungroup hdim and group softcap for Sm80 compilation
|
1 月之前 |
Tri Dao
|
1e3208566a
Tune tile sizes for compilation
|
1 月之前 |
Tri Dao
|
84f1287e42
Rename bool_constant<true> to true_type, same w bool_constant<false>
|
1 月之前 |
Tri Dao
|
df5fe55264
Change tile sizes for Sm8x to reduce stack frame
|
1 月之前 |
Tri Dao
|
8dd0b479d5
Always enable PackGQA for Sm8x to reduce compilation and binary size
|
1 月之前 |
Garrett Byrd
|
a69fc21595
Added hipBLAS/cuBLAS distinction in benchmark_gemm.py (#1393)
|
1 月之前 |
Kevin Wang
|
efbf19cd15
Fix incorrect torch dtype (#1399)
|
1 月之前 |
Cao Dong
|
d525b38291
fix bug when is_grad is false (#1406)
|
1 月之前 |