david/flash-attention

Author	SHA1 Message	Date
Tri Dao	fb9c9cbbe9 Support qkv_descale of shape (batch_size, nheads_kv)	1 month ago
Tri Dao	94657af3e8 Add option for not doing intra-WG overlapping of gemm and softmax	1 month ago
Tri Dao	f0b5a6ec4c Wait for barrier_O at load_tail to avoid Cluster error	1 month ago
Tri Dao	fc2fd95a18 Renable FP8 kernels	1 month ago
Tri Dao	5f8297808d Don't need to write zero to output if Split	1 month ago
Tri Dao	3d0d147940 Early stop on actual num_splits in mha_combine kernel	1 month ago
Tri Dao	1dc3364774 Consolidate seqlen info into a struct	2 months ago
Tri Dao	f6fd36f4b9 Move the check n_block_max<=n_block_min from fwd_kernel to mainloop	2 months ago
Tri Dao	c5ba47b3d5 Add fence.async to epilogue	2 months ago
Tri Dao	9f82a326ad Implement rotary for attn decode	2 months ago
Tri Dao	5e628dcb5b Don't need membar.cta	2 months ago
Tri Dao	4d00645c76 Implement appending new KV to KV cache	2 months ago
Tri Dao	2e4eabd082 Move barrier_O arrive from mainloop to epilogue to simplify	2 months ago
Tri Dao	f5bd27d778 Move PackGQA load_Q to a separate file	2 months ago
Tri Dao	df96486c31 Decode: varlen, paged KV, leftpad	2 months ago
Tri Dao	6e8b25e426 Refactor	3 months ago
Ying Zhang	7b4e68e04f hopper local attention	4 months ago
jayhshah	c92ca63268 FA3 FP8 qkv descales + restore max offset for h128 causal + added sync for producer WG (#1173)	4 months ago
Ying Zhang	a3a257c71d Fix out-of-bound writes for var-seq-len zero-length KVs	5 months ago
jayhshah	5018ac6ac5 Fp8 kernel with "in-kernel" transpose of V in producer (#1100)	5 months ago
Ying Zhang	dfe1a59e4b Add var-seq-len to FA3 fp16 / bf16 fwd (#1072)	6 months ago
Tri Dao	74b0761ff7 [FA3] BF16 forward	6 months ago
Tri Dao	7f67966cc7 FA3 initial code release	6 months ago

Commit History Find

Commit History