Commit History

Author SHA1 Message Date
  Tri Dao fb9c9cbbe9 Support qkv_descale of shape (batch_size, nheads_kv) 1 month ago
  Tri Dao 94657af3e8 Add option for not doing intra-WG overlapping of gemm and softmax 1 month ago
  Tri Dao f0b5a6ec4c Wait for barrier_O at load_tail to avoid Cluster error 1 month ago
  Tri Dao fc2fd95a18 Renable FP8 kernels 1 month ago
  Tri Dao 5f8297808d Don't need to write zero to output if Split 1 month ago
  Tri Dao 3d0d147940 Early stop on actual num_splits in mha_combine kernel 1 month ago
  Tri Dao 1dc3364774 Consolidate seqlen info into a struct 2 months ago
  Tri Dao f6fd36f4b9 Move the check n_block_max<=n_block_min from fwd_kernel to mainloop 2 months ago
  Tri Dao c5ba47b3d5 Add fence.async to epilogue 2 months ago
  Tri Dao 9f82a326ad Implement rotary for attn decode 2 months ago
  Tri Dao 5e628dcb5b Don't need membar.cta 2 months ago
  Tri Dao 4d00645c76 Implement appending new KV to KV cache 2 months ago
  Tri Dao 2e4eabd082 Move barrier_O arrive from mainloop to epilogue to simplify 2 months ago
  Tri Dao f5bd27d778 Move PackGQA load_Q to a separate file 2 months ago
  Tri Dao df96486c31 Decode: varlen, paged KV, leftpad 2 months ago
  Tri Dao 6e8b25e426 Refactor 3 months ago
  Ying Zhang 7b4e68e04f hopper local attention 4 months ago
  jayhshah c92ca63268 FA3 FP8 qkv descales + restore max offset for h128 causal + added sync for producer WG (#1173) 4 months ago
  Ying Zhang a3a257c71d Fix out-of-bound writes for var-seq-len zero-length KVs 5 months ago
  jayhshah 5018ac6ac5 Fp8 kernel with "in-kernel" transpose of V in producer (#1100) 5 months ago
  Ying Zhang dfe1a59e4b Add var-seq-len to FA3 fp16 / bf16 fwd (#1072) 6 months ago
  Tri Dao 74b0761ff7 [FA3] BF16 forward 6 months ago
  Tri Dao 7f67966cc7 FA3 initial code release 6 months ago