Lu Fang 74aed78373 Replace c10::optional with std::optional in flash_attn 2 tygodni temu
..
instantiations e94f7e89dc Always enable PackGQA is Split to reduce compilation and binary size 3 tygodni temu
__init__.py 7f67966cc7 FA3 initial code release 6 miesięcy temu
benchmark_attn.py 5f525322ec Only pass sm_90a compile flag to Sm90 kernels, same w Sm89 kernels 3 tygodni temu
benchmark_flash_attention_fp8.py efbf19cd15 Fix incorrect torch dtype (#1399) 3 tygodni temu
benchmark_split_kv.py a5a75274bc FA3 kvcache + split kv + gqa parallelization (#1236) 3 miesięcy temu
combine.h 478ee666cc Make namespace comment consistent (#1305) 3 miesięcy temu
copy_sm90_bulk_reduce.hpp 7a802796e1 Big refactor and update 3 tygodni temu
epilogue_bwd.hpp 7bc3f031a4 Compile for both Sm80 and Sm90 3 tygodni temu
epilogue_fwd.hpp 7bc3f031a4 Compile for both Sm80 and Sm90 3 tygodni temu
flash.h a84a237d2a Split bwd softcap compilation units for Sm80 3 tygodni temu
flash_api.cpp 74aed78373 Replace c10::optional with std::optional in flash_attn 2 tygodni temu
flash_attn_interface.py 7a802796e1 Big refactor and update 3 tygodni temu
flash_bwd_kernel_sm80.h 7a802796e1 Big refactor and update 3 tygodni temu
flash_bwd_kernel_sm90.h 7a802796e1 Big refactor and update 3 tygodni temu
flash_bwd_launch_template.h a84a237d2a Split bwd softcap compilation units for Sm80 3 tygodni temu
flash_bwd_postprocess_kernel.h 7a802796e1 Big refactor and update 3 tygodni temu
flash_bwd_preprocess_kernel.h 7a802796e1 Big refactor and update 3 tygodni temu
flash_fwd_combine.cu 5f525322ec Only pass sm_90a compile flag to Sm90 kernels, same w Sm89 kernels 3 tygodni temu
flash_fwd_combine_kernel.h 7a802796e1 Big refactor and update 3 tygodni temu
flash_fwd_combine_launch_template.h 7a802796e1 Big refactor and update 3 tygodni temu
flash_fwd_kernel_sm80.h 7a802796e1 Big refactor and update 3 tygodni temu
flash_fwd_kernel_sm90.h a93359a2bf If PackGQA, use producer threads instead of Mma threads to load Q 3 tygodni temu
flash_fwd_launch_template.h 1e3208566a Tune tile sizes for compilation 3 tygodni temu
generate_kernels.py e94f7e89dc Always enable PackGQA is Split to reduce compilation and binary size 3 tygodni temu
heuristics.h 7a802796e1 Big refactor and update 3 tygodni temu
mainloop_bwd_sm80.hpp 7a802796e1 Big refactor and update 3 tygodni temu
mainloop_bwd_sm90_tma_gmma_ws.hpp 7bc3f031a4 Compile for both Sm80 and Sm90 3 tygodni temu
mainloop_fwd_sm80.hpp 84f1287e42 Rename bool_constant<true> to true_type, same w bool_constant<false> 3 tygodni temu
mainloop_fwd_sm90_tma_gmma_ws.hpp a93359a2bf If PackGQA, use producer threads instead of Mma threads to load Q 3 tygodni temu
mask.h 7a802796e1 Big refactor and update 3 tygodni temu
named_barrier.hpp 7bc3f031a4 Compile for both Sm80 and Sm90 3 tygodni temu
pack_gqa.h 7a802796e1 Big refactor and update 3 tygodni temu
padding.py 7a802796e1 Big refactor and update 3 tygodni temu
paged_kv.h 7a802796e1 Big refactor and update 3 tygodni temu
rotary.h 7a802796e1 Big refactor and update 3 tygodni temu
seqlen.h 7a802796e1 Big refactor and update 3 tygodni temu
setup.py 22c0358f4b Fix nvcc_from_env not found 2 tygodni temu
sm90_pipeline_no_cluster.hpp 68bf390920 Update Cutlass to fix mem fence 3 tygodni temu
softmax.h 7a802796e1 Big refactor and update 3 tygodni temu
static_switch.h 180ff782dd Template for Sm86 3 tygodni temu
test_attn_kvcache.py a5a75274bc FA3 kvcache + split kv + gqa parallelization (#1236) 3 miesięcy temu
test_flash_attn.py 2ac6c986be Fix Sm80 tile_count_semaphore, adjust test tolerance 2 tygodni temu
test_kvcache.py a5a75274bc FA3 kvcache + split kv + gqa parallelization (#1236) 3 miesięcy temu
test_util.py 7a802796e1 Big refactor and update 3 tygodni temu
tile_scheduler.hpp 7bc3f031a4 Compile for both Sm80 and Sm90 3 tygodni temu
tile_size.h 1e3208566a Tune tile sizes for compilation 3 tygodni temu
utils.h 7bc3f031a4 Compile for both Sm80 and Sm90 3 tygodni temu