Tri Dao 6807b1ea37 Longest-processing-time-first scheduler for causal há 21 horas atrás
..
instantiations fc2fd95a18 Renable FP8 kernels há 3 semanas atrás
__init__.py 7f67966cc7 FA3 initial code release há 5 meses atrás
benchmark_attn.py 82c1aa3514 Move PackGQA epilogue code to pack_gqa.h há 1 mês atrás
benchmark_flash_attention_fp8.py c92ca63268 FA3 FP8 qkv descales + restore max offset for h128 causal + added sync for producer WG (#1173) há 3 meses atrás
copy_sm90_bulk_reduce.hpp 29cdfedd80 Use Bulk reduce instead of TMA for dQaccum, split across WGs há 1 semana atrás
epilogue_bwd_sm90_tma.hpp 2c996ca25f Use SeqlenInfo for bwd and epilogue há 1 semana atrás
epilogue_fwd_sm90_tma.hpp 2c996ca25f Use SeqlenInfo for bwd and epilogue há 1 semana atrás
flash.h fb9c9cbbe9 Support qkv_descale of shape (batch_size, nheads_kv) há 3 dias atrás
flash_api.cpp 6807b1ea37 Longest-processing-time-first scheduler for causal há 21 horas atrás
flash_attn_interface.py 0c49ac9a07 Implement rotary non-interleaved há 3 semanas atrás
flash_bwd_kernel.h 29cdfedd80 Use Bulk reduce instead of TMA for dQaccum, split across WGs há 1 semana atrás
flash_bwd_launch_template.h 6807b1ea37 Longest-processing-time-first scheduler for causal há 21 horas atrás
flash_bwd_postprocess_kernel.h 2c996ca25f Use SeqlenInfo for bwd and epilogue há 1 semana atrás
flash_bwd_preprocess_kernel.h 2c996ca25f Use SeqlenInfo for bwd and epilogue há 1 semana atrás
flash_fwd_combine_kernel.h 2c996ca25f Use SeqlenInfo for bwd and epilogue há 1 semana atrás
flash_fwd_combine_launch_template.h 9fd6b977bb Precompute the pointers in mha_combine kernel há 3 semanas atrás
flash_fwd_combine_sm80.cu 9fd6b977bb Precompute the pointers in mha_combine kernel há 3 semanas atrás
flash_fwd_kernel.h fb9c9cbbe9 Support qkv_descale of shape (batch_size, nheads_kv) há 3 dias atrás
flash_fwd_launch_template.h 6807b1ea37 Longest-processing-time-first scheduler for causal há 21 horas atrás
generate_kernels.py fc2fd95a18 Renable FP8 kernels há 3 semanas atrás
heuristics.h 6807b1ea37 Longest-processing-time-first scheduler for causal há 21 horas atrás
mainloop_bwd_sm90_tma_gmma_ws.hpp ae3c1fb3e0 Simplify bwd by setting NumdQWarpGroups = NumMmaWarpGroups há 1 semana atrás
mainloop_fwd_sm90_tma_gmma_ws.hpp fb9c9cbbe9 Support qkv_descale of shape (batch_size, nheads_kv) há 3 dias atrás
mask.h 3b6ac2b954 Use compile time constants in local mask há 1 semana atrás
named_barrier.hpp 29cdfedd80 Use Bulk reduce instead of TMA for dQaccum, split across WGs há 1 semana atrás
pack_gqa.h fe412d6b36 Redo rotary when contiguous há 3 semanas atrás
paged_kv.h 94657af3e8 Add option for not doing intra-WG overlapping of gemm and softmax há 3 semanas atrás
rotary.h 82dc825759 Don't use the unsafe convert_type function há 2 semanas atrás
seqlen.h 2c996ca25f Use SeqlenInfo for bwd and epilogue há 1 semana atrás
setup.py 6807b1ea37 Longest-processing-time-first scheduler for causal há 21 horas atrás
softmax.h 6807b1ea37 Longest-processing-time-first scheduler for causal há 21 horas atrás
static_switch.h 42fc4962f0 Uncomment tanh softcapping há 2 semanas atrás
test_flash_attn.py 6807b1ea37 Longest-processing-time-first scheduler for causal há 21 horas atrás
tile_scheduler.hpp 6807b1ea37 Longest-processing-time-first scheduler for causal há 21 horas atrás
tile_size.h 6807b1ea37 Longest-processing-time-first scheduler for causal há 21 horas atrás
utils.h e8a1edbeb2 Clean up some #include há 2 semanas atrás