david/flash-attention

Autor	SHA1 Mensaje	Fecha
Tri Dao	fb9c9cbbe9 Support qkv_descale of shape (batch_size, nheads_kv)	hace 2 días
Tri Dao	6293008748 Add option for Mma0_is_RS and Mma1_is_RS in attn fwd	hace 6 días
Tri Dao	ae3c1fb3e0 Simplify bwd by setting NumdQWarpGroups = NumMmaWarpGroups	hace 6 días
Tri Dao	2c996ca25f Use SeqlenInfo for bwd and epilogue	hace 1 semana
Tri Dao	88fdffc16e Fix test for softcap FP8	hace 1 semana
Tri Dao	f5e89ff136 Tune tile size for bwd softcap	hace 1 semana
Tri Dao	3b6ac2b954 Use compile time constants in local mask	hace 1 semana
Tri Dao	bfbaafd043 Fix bwd reading out of out LSE	hace 1 semana
Tri Dao	29cdfedd80 Use Bulk reduce instead of TMA for dQaccum, split across WGs	hace 1 semana
Tri Dao	9c954f7021 Use num_split_heuristics in fwd and fwd_varlen	hace 1 semana
Tri Dao	314b9edfc0 Don't need to link to cuda lib anymore	hace 1 semana
Tri Dao	f6e165becf Change tile_size and local to avoid wgmma being serialized	hace 1 semana
Tri Dao	f11624b746 Disable --split-compile due to ptxas register allocation failure	hace 1 semana
Tri Dao	e8a1edbeb2 Clean up some #include	hace 1 semana
Tri Dao	8ae77ea17c Download nvcc 12.3 to compile for best perf	hace 1 semana
Tri Dao	199c82052c Fix test for has_batch_idx	hace 1 semana
Tri Dao	3ed79742fb Add option to shuffle LSE and dPsum in the bwd	hace 2 semanas
Tri Dao	42fc4962f0 Uncomment tanh softcapping	hace 2 semanas
Tri Dao	6bc55b571c Use --split-compile to speed up compilation	hace 2 semanas
Tri Dao	82dc825759 Don't use the unsafe convert_type function	hace 2 semanas
Tri Dao	9553b2728f More env vars to disable features	hace 2 semanas
Tri Dao	3248babb9e QOL: Use env var to selectively disable features	hace 2 semanas
Tri Dao	c9c40eba83 Uncomment local attn	hace 2 semanas
Tri Dao	94657af3e8 Add option for not doing intra-WG overlapping of gemm and softmax	hace 2 semanas
Tri Dao	a4d41d2605 Fix epilogue compilation	hace 2 semanas
Tri Dao	f0b5a6ec4c Wait for barrier_O at load_tail to avoid Cluster error	hace 2 semanas
Tri Dao	95ba9e51e5 Simplify epilogue when split by using thread_mma.partition_C	hace 3 semanas
Tri Dao	47d4d2a76d Fix FP8 hdim 256 perf regression	hace 3 semanas
Tri Dao	fc2fd95a18 Renable FP8 kernels	hace 3 semanas
Tri Dao	e7b93e3902 Clean up mha_combine kernel	hace 3 semanas

Posterior Anterior

Historial de Commits Buscar

Historial de Commits