Commit History

Author SHA1 Message Date
  Tri Dao fc2fd95a18 Renable FP8 kernels 1 month ago
  Tri Dao e7b93e3902 Clean up mha_combine kernel 1 month ago
  Tri Dao 5f8297808d Don't need to write zero to output if Split 1 month ago
  Tri Dao 3d0d147940 Early stop on actual num_splits in mha_combine kernel 1 month ago
  Tri Dao 9fd6b977bb Precompute the pointers in mha_combine kernel 1 month ago
  Tri Dao 64d92bce53 Split PagedKV into separate .cu files to speed up compilation 1 month ago
  Tri Dao fe412d6b36 Redo rotary when contiguous 1 month ago
  Tri Dao bc8a001d8d Load cos/sin by splitting the work among threads on the same row 1 month ago
  Tri Dao 1dc3364774 Consolidate seqlen info into a struct 1 month ago
  Tri Dao 586ba914bb Move fwd tile size to a separate file 1 month ago
  Tri Dao f6fd36f4b9 Move the check n_block_max<=n_block_min from fwd_kernel to mainloop 1 month ago
  Tri Dao 5194d9b2e6 Include torch/version.h for TORCH_VERSION* macros 1 month ago
  Tri Dao 25dbfa6452 Add heuristic for setting num_splits from FA2 1 month ago
  Tri Dao 018b9af683 Move .cu files to instantiations, use generate_kernels.py 1 month ago
  Tri Dao 0c49ac9a07 Implement rotary non-interleaved 1 month ago
  Tri Dao b2d3fe92ff Move rotary to a separate file 1 month ago
  Tri Dao c5ba47b3d5 Add fence.async to epilogue 1 month ago
  Tri Dao 0290574956 Put #if to avoid redefinition with torch >= 2.4 1 month ago
  Tri Dao 9f82a326ad Implement rotary for attn decode 1 month ago
  Tri Dao 5e628dcb5b Don't need membar.cta 1 month ago
  Tri Dao 4d00645c76 Implement appending new KV to KV cache 1 month ago
  Tri Dao 2e4eabd082 Move barrier_O arrive from mainloop to epilogue to simplify 1 month ago
  Tri Dao 82c1aa3514 Move PackGQA epilogue code to pack_gqa.h 1 month ago
  Tri Dao f5bd27d778 Move PackGQA load_Q to a separate file 1 month ago
  Tri Dao 7fccac78ce Fix Mask 1 month ago
  Tri Dao 4860b1068f Fix mha_combine tests 1 month ago
  Tri Dao d00b88ee05 Move PagedKV to a separate file 1 month ago
  Tri Dao a65af55f4a Move mask_fn and load_Q into separate functions 1 month ago
  Tri Dao 01de45b8f8 Add combine kernel 1 month ago
  Tri Dao df96486c31 Decode: varlen, paged KV, leftpad 1 month ago