sclarkson 1feb711f46 Fix compilation with clang on ARM64 (#1285)		há 1 semana atrás
..
README.md	dfe29f5e2b [Gen] Don't use ft_attention, use flash_attn_with_kvcache instead	há 1 ano atrás
cuda_bf16_fallbacks.cuh	a01d1213d7 [Gen] Add kernel from FasterTransformer for benchmarking	há 1 ano atrás
cuda_bf16_wrapper.h	a01d1213d7 [Gen] Add kernel from FasterTransformer for benchmarking	há 1 ano atrás
decoder_masked_multihead_attention.cu	c3f2a632aa [ft_attention] Fix for seqlen=8136 (#488)	há 1 ano atrás
decoder_masked_multihead_attention.h	a157cc8c9b [FT] Implement MQA/GQA	há 1 ano atrás
decoder_masked_multihead_attention_template.hpp	a157cc8c9b [FT] Implement MQA/GQA	há 1 ano atrás
decoder_masked_multihead_attention_utils.h	3a9bfd076f [FT] rotary_cos/sin should have shape (dim) instead of (seqlen, dim)	há 1 ano atrás
ft_attention.cpp	1feb711f46 Fix compilation with clang on ARM64 (#1285)	há 1 semana atrás
setup.py	50896ec574 Make nvcc threads configurable via environment variable (#885)	há 9 meses atrás

Attention kernel from FasterTransformer

This CUDA extension wraps the single-query attention kernel from FasterTransformer v5.2.1 for benchmarking purpose.

cd csrc/ft_attention && pip install .

As of 2023-09-17, this extension is no longer used in the FlashAttention repo. FlashAttention now has implemented flash_attn_with_kvcache with all the features of this ft_attention kernel (and more).

README.md

Attention kernel from FasterTransformer