.. |
instantiations
|
fc2fd95a18
Renable FP8 kernels
|
3 minggu lalu |
__init__.py
|
7f67966cc7
FA3 initial code release
|
5 bulan lalu |
benchmark_attn.py
|
82c1aa3514
Move PackGQA epilogue code to pack_gqa.h
|
1 bulan lalu |
benchmark_flash_attention_fp8.py
|
c92ca63268
FA3 FP8 qkv descales + restore max offset for h128 causal + added sync for producer WG (#1173)
|
3 bulan lalu |
copy_sm90_bulk_reduce.hpp
|
29cdfedd80
Use Bulk reduce instead of TMA for dQaccum, split across WGs
|
1 Minggu lalu |
epilogue_bwd_sm90_tma.hpp
|
2c996ca25f
Use SeqlenInfo for bwd and epilogue
|
1 Minggu lalu |
epilogue_fwd_sm90_tma.hpp
|
2c996ca25f
Use SeqlenInfo for bwd and epilogue
|
1 Minggu lalu |
flash.h
|
fb9c9cbbe9
Support qkv_descale of shape (batch_size, nheads_kv)
|
2 hari lalu |
flash_api.cpp
|
6807b1ea37
Longest-processing-time-first scheduler for causal
|
8 jam lalu |
flash_attn_interface.py
|
0c49ac9a07
Implement rotary non-interleaved
|
3 minggu lalu |
flash_bwd_kernel.h
|
29cdfedd80
Use Bulk reduce instead of TMA for dQaccum, split across WGs
|
1 Minggu lalu |
flash_bwd_launch_template.h
|
6807b1ea37
Longest-processing-time-first scheduler for causal
|
8 jam lalu |
flash_bwd_postprocess_kernel.h
|
2c996ca25f
Use SeqlenInfo for bwd and epilogue
|
1 Minggu lalu |
flash_bwd_preprocess_kernel.h
|
2c996ca25f
Use SeqlenInfo for bwd and epilogue
|
1 Minggu lalu |
flash_fwd_combine_kernel.h
|
2c996ca25f
Use SeqlenInfo for bwd and epilogue
|
1 Minggu lalu |
flash_fwd_combine_launch_template.h
|
9fd6b977bb
Precompute the pointers in mha_combine kernel
|
3 minggu lalu |
flash_fwd_combine_sm80.cu
|
9fd6b977bb
Precompute the pointers in mha_combine kernel
|
3 minggu lalu |
flash_fwd_kernel.h
|
fb9c9cbbe9
Support qkv_descale of shape (batch_size, nheads_kv)
|
2 hari lalu |
flash_fwd_launch_template.h
|
6807b1ea37
Longest-processing-time-first scheduler for causal
|
8 jam lalu |
generate_kernels.py
|
fc2fd95a18
Renable FP8 kernels
|
3 minggu lalu |
heuristics.h
|
6807b1ea37
Longest-processing-time-first scheduler for causal
|
8 jam lalu |
mainloop_bwd_sm90_tma_gmma_ws.hpp
|
ae3c1fb3e0
Simplify bwd by setting NumdQWarpGroups = NumMmaWarpGroups
|
1 Minggu lalu |
mainloop_fwd_sm90_tma_gmma_ws.hpp
|
fb9c9cbbe9
Support qkv_descale of shape (batch_size, nheads_kv)
|
2 hari lalu |
mask.h
|
3b6ac2b954
Use compile time constants in local mask
|
1 Minggu lalu |
named_barrier.hpp
|
29cdfedd80
Use Bulk reduce instead of TMA for dQaccum, split across WGs
|
1 Minggu lalu |
pack_gqa.h
|
fe412d6b36
Redo rotary when contiguous
|
3 minggu lalu |
paged_kv.h
|
94657af3e8
Add option for not doing intra-WG overlapping of gemm and softmax
|
3 minggu lalu |
rotary.h
|
82dc825759
Don't use the unsafe convert_type function
|
2 minggu lalu |
seqlen.h
|
2c996ca25f
Use SeqlenInfo for bwd and epilogue
|
1 Minggu lalu |
setup.py
|
6807b1ea37
Longest-processing-time-first scheduler for causal
|
8 jam lalu |
softmax.h
|
6807b1ea37
Longest-processing-time-first scheduler for causal
|
8 jam lalu |
static_switch.h
|
42fc4962f0
Uncomment tanh softcapping
|
2 minggu lalu |
test_flash_attn.py
|
6807b1ea37
Longest-processing-time-first scheduler for causal
|
8 jam lalu |
tile_scheduler.hpp
|
6807b1ea37
Longest-processing-time-first scheduler for causal
|
8 jam lalu |
tile_size.h
|
6807b1ea37
Longest-processing-time-first scheduler for causal
|
8 jam lalu |
utils.h
|
e8a1edbeb2
Clean up some #include
|
1 Minggu lalu |