david/flash-attention

Author	SHA1 Message	Date
Tri Dao	94657af3e8 Add option for not doing intra-WG overlapping of gemm and softmax	2 months ago
Tri Dao	a4d41d2605 Fix epilogue compilation	2 months ago
Tri Dao	f0b5a6ec4c Wait for barrier_O at load_tail to avoid Cluster error	2 months ago
Tri Dao	95ba9e51e5 Simplify epilogue when split by using thread_mma.partition_C	2 months ago
Tri Dao	47d4d2a76d Fix FP8 hdim 256 perf regression	2 months ago
Tri Dao	fc2fd95a18 Renable FP8 kernels	2 months ago
Tri Dao	e7b93e3902 Clean up mha_combine kernel	2 months ago
Tri Dao	5f8297808d Don't need to write zero to output if Split	2 months ago
Tri Dao	3d0d147940 Early stop on actual num_splits in mha_combine kernel	2 months ago
Tri Dao	9fd6b977bb Precompute the pointers in mha_combine kernel	2 months ago
Tri Dao	64d92bce53 Split PagedKV into separate .cu files to speed up compilation	2 months ago
Tri Dao	fe412d6b36 Redo rotary when contiguous	2 months ago
Tri Dao	bc8a001d8d Load cos/sin by splitting the work among threads on the same row	2 months ago
Tri Dao	1dc3364774 Consolidate seqlen info into a struct	2 months ago
Tri Dao	586ba914bb Move fwd tile size to a separate file	2 months ago
Tri Dao	f6fd36f4b9 Move the check n_block_max<=n_block_min from fwd_kernel to mainloop	2 months ago
Tri Dao	5194d9b2e6 Include torch/version.h for TORCH_VERSION* macros	2 months ago
Tri Dao	25dbfa6452 Add heuristic for setting num_splits from FA2	2 months ago
Tri Dao	018b9af683 Move .cu files to instantiations, use generate_kernels.py	2 months ago
Tri Dao	0c49ac9a07 Implement rotary non-interleaved	2 months ago
Tri Dao	b2d3fe92ff Move rotary to a separate file	2 months ago
Tri Dao	c5ba47b3d5 Add fence.async to epilogue	2 months ago
Tri Dao	0290574956 Put #if to avoid redefinition with torch >= 2.4	2 months ago
Tri Dao	9f82a326ad Implement rotary for attn decode	2 months ago
Tri Dao	5e628dcb5b Don't need membar.cta	2 months ago
Tri Dao	4d00645c76 Implement appending new KV to KV cache	2 months ago
Tri Dao	2e4eabd082 Move barrier_O arrive from mainloop to epilogue to simplify	2 months ago
Tri Dao	82c1aa3514 Move PackGQA epilogue code to pack_gqa.h	2 months ago
Tri Dao	f5bd27d778 Move PackGQA load_Q to a separate file	2 months ago
Tri Dao	7fccac78ce Fix Mask	2 months ago

Newer Older

Commit History Find

Commit History