Commit History

Author SHA1 Message Date
  Tri Dao 94657af3e8 Add option for not doing intra-WG overlapping of gemm and softmax 2 months ago
  Tri Dao a4d41d2605 Fix epilogue compilation 2 months ago
  Tri Dao f0b5a6ec4c Wait for barrier_O at load_tail to avoid Cluster error 2 months ago
  Tri Dao 95ba9e51e5 Simplify epilogue when split by using thread_mma.partition_C 2 months ago
  Tri Dao 47d4d2a76d Fix FP8 hdim 256 perf regression 2 months ago
  Tri Dao fc2fd95a18 Renable FP8 kernels 2 months ago
  Tri Dao e7b93e3902 Clean up mha_combine kernel 2 months ago
  Tri Dao 5f8297808d Don't need to write zero to output if Split 2 months ago
  Tri Dao 3d0d147940 Early stop on actual num_splits in mha_combine kernel 2 months ago
  Tri Dao 9fd6b977bb Precompute the pointers in mha_combine kernel 2 months ago
  Tri Dao 64d92bce53 Split PagedKV into separate .cu files to speed up compilation 2 months ago
  Tri Dao fe412d6b36 Redo rotary when contiguous 2 months ago
  Tri Dao bc8a001d8d Load cos/sin by splitting the work among threads on the same row 2 months ago
  Tri Dao 1dc3364774 Consolidate seqlen info into a struct 2 months ago
  Tri Dao 586ba914bb Move fwd tile size to a separate file 2 months ago
  Tri Dao f6fd36f4b9 Move the check n_block_max<=n_block_min from fwd_kernel to mainloop 2 months ago
  Tri Dao 5194d9b2e6 Include torch/version.h for TORCH_VERSION* macros 2 months ago
  Tri Dao 25dbfa6452 Add heuristic for setting num_splits from FA2 2 months ago
  Tri Dao 018b9af683 Move .cu files to instantiations, use generate_kernels.py 2 months ago
  Tri Dao 0c49ac9a07 Implement rotary non-interleaved 2 months ago
  Tri Dao b2d3fe92ff Move rotary to a separate file 2 months ago
  Tri Dao c5ba47b3d5 Add fence.async to epilogue 2 months ago
  Tri Dao 0290574956 Put #if to avoid redefinition with torch >= 2.4 2 months ago
  Tri Dao 9f82a326ad Implement rotary for attn decode 2 months ago
  Tri Dao 5e628dcb5b Don't need membar.cta 2 months ago
  Tri Dao 4d00645c76 Implement appending new KV to KV cache 2 months ago
  Tri Dao 2e4eabd082 Move barrier_O arrive from mainloop to epilogue to simplify 2 months ago
  Tri Dao 82c1aa3514 Move PackGQA epilogue code to pack_gqa.h 2 months ago
  Tri Dao f5bd27d778 Move PackGQA load_Q to a separate file 2 months ago
  Tri Dao 7fccac78ce Fix Mask 2 months ago