This CUDA extension wraps the single-query attention kernel from FasterTransformer v5.2.1 for benchmarking purpose.
cd csrc/ft_attention && pip install .