Tri Dao
|
94657af3e8
Add option for not doing intra-WG overlapping of gemm and softmax
|
2 months ago |
Tri Dao
|
a4d41d2605
Fix epilogue compilation
|
2 months ago |
Tri Dao
|
f0b5a6ec4c
Wait for barrier_O at load_tail to avoid Cluster error
|
2 months ago |
Tri Dao
|
95ba9e51e5
Simplify epilogue when split by using thread_mma.partition_C
|
2 months ago |
Tri Dao
|
47d4d2a76d
Fix FP8 hdim 256 perf regression
|
2 months ago |
Tri Dao
|
fc2fd95a18
Renable FP8 kernels
|
2 months ago |
Tri Dao
|
e7b93e3902
Clean up mha_combine kernel
|
2 months ago |
Tri Dao
|
5f8297808d
Don't need to write zero to output if Split
|
2 months ago |
Tri Dao
|
3d0d147940
Early stop on actual num_splits in mha_combine kernel
|
2 months ago |
Tri Dao
|
9fd6b977bb
Precompute the pointers in mha_combine kernel
|
2 months ago |
Tri Dao
|
64d92bce53
Split PagedKV into separate .cu files to speed up compilation
|
2 months ago |
Tri Dao
|
fe412d6b36
Redo rotary when contiguous
|
2 months ago |
Tri Dao
|
bc8a001d8d
Load cos/sin by splitting the work among threads on the same row
|
2 months ago |
Tri Dao
|
1dc3364774
Consolidate seqlen info into a struct
|
2 months ago |
Tri Dao
|
586ba914bb
Move fwd tile size to a separate file
|
2 months ago |
Tri Dao
|
f6fd36f4b9
Move the check n_block_max<=n_block_min from fwd_kernel to mainloop
|
2 months ago |
Tri Dao
|
5194d9b2e6
Include torch/version.h for TORCH_VERSION* macros
|
2 months ago |
Tri Dao
|
25dbfa6452
Add heuristic for setting num_splits from FA2
|
2 months ago |
Tri Dao
|
018b9af683
Move .cu files to instantiations, use generate_kernels.py
|
2 months ago |
Tri Dao
|
0c49ac9a07
Implement rotary non-interleaved
|
2 months ago |
Tri Dao
|
b2d3fe92ff
Move rotary to a separate file
|
2 months ago |
Tri Dao
|
c5ba47b3d5
Add fence.async to epilogue
|
2 months ago |
Tri Dao
|
0290574956
Put #if to avoid redefinition with torch >= 2.4
|
2 months ago |
Tri Dao
|
9f82a326ad
Implement rotary for attn decode
|
2 months ago |
Tri Dao
|
5e628dcb5b
Don't need membar.cta
|
2 months ago |
Tri Dao
|
4d00645c76
Implement appending new KV to KV cache
|
2 months ago |
Tri Dao
|
2e4eabd082
Move barrier_O arrive from mainloop to epilogue to simplify
|
2 months ago |
Tri Dao
|
82c1aa3514
Move PackGQA epilogue code to pack_gqa.h
|
2 months ago |
Tri Dao
|
f5bd27d778
Move PackGQA load_Q to a separate file
|
2 months ago |
Tri Dao
|
7fccac78ce
Fix Mask
|
2 months ago |