Tri Dao
|
4f0640d534
Move writing P to smem as separate function
|
há 1 dia atrás |
Tri Dao
|
3edf7e0daa
Add kwargs to _write_ninja_file for compatibility with new torch
|
há 1 dia atrás |
Tri Dao
|
45c48afb2b
Add option for WG1 to use RS MMA but WG2 using SS MMA
|
há 2 dias atrás |
xin-w8023
|
6865e60145
fix: prompt index to type longlong to avoid numerical overflow (#1500)
|
há 4 dias atrás |
Tri Dao
|
5458c78e6d
Remove sink token
|
há 4 dias atrás |
Tri Dao
|
20b84d6363
Don't use IntraWGOverlap for hdim 64,512
|
há 4 dias atrás |
Lucas Wilkinson
|
39e7197564
Fix cuda 12.1 build (#1511)
|
há 5 dias atrás |
Tri Dao
|
085ce5864a
Change margin in prepare_scheduler.cu from 20% to 10%
|
há 5 dias atrás |
Tri Dao
|
08f4c802c4
Add FLOPS to MLA decode benchmark
|
há 5 dias atrás |
Jiang, Zhiwei
|
dec83a10c4
fix: add "typename" prior to dependent type name (#1517)
|
há 5 dias atrás |
Tri Dao
|
3b5047d2ce
Fix loop in prepare_scheduler.cu (h/t Jay Shah)
|
há 1 semana atrás |
Tri Dao
|
9505c7436e
Adjust seqlen_q in MLA decode benchmark script
|
há 1 semana atrás |
Tri Dao
|
cdda5bfdd7
Update to Cutlass 3.8.0 tag
|
há 1 semana atrás |
Tri Dao
|
6752d62aa4
Add dynamic splits
|
há 1 semana atrás |
Tri Dao
|
6aed835dd9
Add simple script to benchmark MLA decode
|
há 1 semana atrás |
Ted Zadouri
|
06e34f62d1
Enable MLA flag in FA3 (rope=64, latent=512) (#1504)
|
há 1 semana atrás |
Tri Dao
|
ecdb528dea
Make rotary test optional in FA3
|
há 1 semana atrás |
Tri Dao
|
b36ad4ef76
Use split for super long sequences that don't fit into L2
|
há 2 semanas atrás |
Tri Dao
|
74dfa43c8d
Fix divide by 0 in causal tile_scheduler for large seqlen
|
há 2 semanas atrás |
Tri Dao
|
ea3ecea97a
Add tp_degree to benchmark_split_kv
|
há 2 semanas atrás |
Tri Dao
|
91917b406b
Update benchmark_split_kv.py to work w new API
|
há 2 semanas atrás |
Tri Dao
|
40cbd529e4
Temporarily change package name of FA3 to allow FA2 & FA3 install
|
há 2 semanas atrás |
Anton Vlasjuk
|
a09abcd32d
make seqused optional on top level interface (#1497)
|
há 2 semanas atrás |
Tri Dao
|
fa445ff6c2
Fix FP8 test
|
há 3 semanas atrás |
Tri Dao
|
eafd53c2f1
Update cutlass 3.8 to fix error w cudaGetDriverEntryPointByVersion
|
há 3 semanas atrás |
Tri Dao
|
9f313c7073
Move functions getting number of m/n blocks to a separate file
|
há 3 semanas atrás |
Tri Dao
|
15cf7ee435
Rename collective_mainloop -> mainloop, move tile_scheduler variable
|
há 3 semanas atrás |
Tri Dao
|
1a7f4dfa9e
Adjust ninja build file
|
há 3 semanas atrás |
Tri Dao
|
5e39b100b4
Adjust tile size for hdim 64
|
há 3 semanas atrás |
Tri Dao
|
c091545720
Update Cutlass to 3.8
|
há 3 semanas atrás |