.. |
instantiations
|
fc2fd95a18
Renable FP8 kernels
|
3 周之前 |
__init__.py
|
7f67966cc7
FA3 initial code release
|
5 月之前 |
benchmark_attn.py
|
82c1aa3514
Move PackGQA epilogue code to pack_gqa.h
|
1 月之前 |
benchmark_flash_attention_fp8.py
|
c92ca63268
FA3 FP8 qkv descales + restore max offset for h128 causal + added sync for producer WG (#1173)
|
3 月之前 |
copy_sm90_bulk_reduce.hpp
|
29cdfedd80
Use Bulk reduce instead of TMA for dQaccum, split across WGs
|
1 周之前 |
epilogue_bwd_sm90_tma.hpp
|
2c996ca25f
Use SeqlenInfo for bwd and epilogue
|
1 周之前 |
epilogue_fwd_sm90_tma.hpp
|
2c996ca25f
Use SeqlenInfo for bwd and epilogue
|
1 周之前 |
flash.h
|
fb9c9cbbe9
Support qkv_descale of shape (batch_size, nheads_kv)
|
2 天之前 |
flash_api.cpp
|
fb9c9cbbe9
Support qkv_descale of shape (batch_size, nheads_kv)
|
2 天之前 |
flash_attn_interface.py
|
0c49ac9a07
Implement rotary non-interleaved
|
3 周之前 |
flash_bwd_kernel.h
|
29cdfedd80
Use Bulk reduce instead of TMA for dQaccum, split across WGs
|
1 周之前 |
flash_bwd_launch_template.h
|
ae3c1fb3e0
Simplify bwd by setting NumdQWarpGroups = NumMmaWarpGroups
|
6 天之前 |
flash_bwd_postprocess_kernel.h
|
2c996ca25f
Use SeqlenInfo for bwd and epilogue
|
1 周之前 |
flash_bwd_preprocess_kernel.h
|
2c996ca25f
Use SeqlenInfo for bwd and epilogue
|
1 周之前 |
flash_fwd_combine_kernel.h
|
2c996ca25f
Use SeqlenInfo for bwd and epilogue
|
1 周之前 |
flash_fwd_combine_launch_template.h
|
9fd6b977bb
Precompute the pointers in mha_combine kernel
|
3 周之前 |
flash_fwd_combine_sm80.cu
|
9fd6b977bb
Precompute the pointers in mha_combine kernel
|
3 周之前 |
flash_fwd_kernel.h
|
fb9c9cbbe9
Support qkv_descale of shape (batch_size, nheads_kv)
|
2 天之前 |
flash_fwd_launch_template.h
|
fb9c9cbbe9
Support qkv_descale of shape (batch_size, nheads_kv)
|
2 天之前 |
generate_kernels.py
|
fc2fd95a18
Renable FP8 kernels
|
3 周之前 |
mainloop_bwd_sm90_tma_gmma_ws.hpp
|
ae3c1fb3e0
Simplify bwd by setting NumdQWarpGroups = NumMmaWarpGroups
|
6 天之前 |
mainloop_fwd_sm90_tma_gmma_ws.hpp
|
fb9c9cbbe9
Support qkv_descale of shape (batch_size, nheads_kv)
|
2 天之前 |
mask.h
|
3b6ac2b954
Use compile time constants in local mask
|
1 周之前 |
named_barrier.hpp
|
29cdfedd80
Use Bulk reduce instead of TMA for dQaccum, split across WGs
|
1 周之前 |
pack_gqa.h
|
fe412d6b36
Redo rotary when contiguous
|
3 周之前 |
paged_kv.h
|
94657af3e8
Add option for not doing intra-WG overlapping of gemm and softmax
|
2 周之前 |
rotary.h
|
82dc825759
Don't use the unsafe convert_type function
|
2 周之前 |
seqlen.h
|
2c996ca25f
Use SeqlenInfo for bwd and epilogue
|
1 周之前 |
setup.py
|
29cdfedd80
Use Bulk reduce instead of TMA for dQaccum, split across WGs
|
1 周之前 |
softmax.h
|
6293008748
Add option for Mma0_is_RS and Mma1_is_RS in attn fwd
|
6 天之前 |
static_switch.h
|
42fc4962f0
Uncomment tanh softcapping
|
2 周之前 |
test_flash_attn.py
|
fb9c9cbbe9
Support qkv_descale of shape (batch_size, nheads_kv)
|
2 天之前 |
tile_scheduler.hpp
|
df96486c31
Decode: varlen, paged KV, leftpad
|
1 月之前 |
tile_size.h
|
6293008748
Add option for Mma0_is_RS and Mma1_is_RS in attn fwd
|
6 天之前 |
utils.h
|
e8a1edbeb2
Clean up some #include
|
1 周之前 |