Tri Dao
|
fc2fd95a18
Renable FP8 kernels
|
1 month ago |
Tri Dao
|
e7b93e3902
Clean up mha_combine kernel
|
1 month ago |
Tri Dao
|
5f8297808d
Don't need to write zero to output if Split
|
1 month ago |
Tri Dao
|
3d0d147940
Early stop on actual num_splits in mha_combine kernel
|
1 month ago |
Tri Dao
|
9fd6b977bb
Precompute the pointers in mha_combine kernel
|
1 month ago |
Tri Dao
|
64d92bce53
Split PagedKV into separate .cu files to speed up compilation
|
1 month ago |
Tri Dao
|
fe412d6b36
Redo rotary when contiguous
|
1 month ago |
Tri Dao
|
bc8a001d8d
Load cos/sin by splitting the work among threads on the same row
|
1 month ago |
Tri Dao
|
1dc3364774
Consolidate seqlen info into a struct
|
1 month ago |
Tri Dao
|
586ba914bb
Move fwd tile size to a separate file
|
1 month ago |
Tri Dao
|
f6fd36f4b9
Move the check n_block_max<=n_block_min from fwd_kernel to mainloop
|
1 month ago |
Tri Dao
|
5194d9b2e6
Include torch/version.h for TORCH_VERSION* macros
|
1 month ago |
Tri Dao
|
25dbfa6452
Add heuristic for setting num_splits from FA2
|
1 month ago |
Tri Dao
|
018b9af683
Move .cu files to instantiations, use generate_kernels.py
|
1 month ago |
Tri Dao
|
0c49ac9a07
Implement rotary non-interleaved
|
1 month ago |
Tri Dao
|
b2d3fe92ff
Move rotary to a separate file
|
1 month ago |
Tri Dao
|
c5ba47b3d5
Add fence.async to epilogue
|
1 month ago |
Tri Dao
|
0290574956
Put #if to avoid redefinition with torch >= 2.4
|
1 month ago |
Tri Dao
|
9f82a326ad
Implement rotary for attn decode
|
1 month ago |
Tri Dao
|
5e628dcb5b
Don't need membar.cta
|
1 month ago |
Tri Dao
|
4d00645c76
Implement appending new KV to KV cache
|
1 month ago |
Tri Dao
|
2e4eabd082
Move barrier_O arrive from mainloop to epilogue to simplify
|
1 month ago |
Tri Dao
|
82c1aa3514
Move PackGQA epilogue code to pack_gqa.h
|
1 month ago |
Tri Dao
|
f5bd27d778
Move PackGQA load_Q to a separate file
|
1 month ago |
Tri Dao
|
7fccac78ce
Fix Mask
|
1 month ago |
Tri Dao
|
4860b1068f
Fix mha_combine tests
|
1 month ago |
Tri Dao
|
d00b88ee05
Move PagedKV to a separate file
|
1 month ago |
Tri Dao
|
a65af55f4a
Move mask_fn and load_Q into separate functions
|
1 month ago |
Tri Dao
|
01de45b8f8
Add combine kernel
|
1 month ago |
Tri Dao
|
df96486c31
Decode: varlen, paged KV, leftpad
|
1 month ago |