david/flash-attention: flash-attention from https://github.com/Dao-AILab/flash-attention @ e45a46a5b767d76e14c76e4bfac408b7cf94d896

Tri Dao e45a46a5b7 [Rotary] Implement GPT-J style (interleaved) rotary		%!s(int64=2) %!d(string=hai) anos
..
layers	e45a46a5b7 [Rotary] Implement GPT-J style (interleaved) rotary	%!s(int64=2) %!d(string=hai) anos
losses	c6ecd40a59 Tweak CrossEntropyLoss to take process_group in init	%!s(int64=2) %!d(string=hai) anos
models	78b7a1dc18 [OPT] Load fp16 weights on CPU before moving to GPU	%!s(int64=2) %!d(string=hai) anos
modules	88173a1aaf [FusedDense] Support relu, rename FusedDenseGeluDense -> FusedMLP	%!s(int64=2) %!d(string=hai) anos
ops	eb33e587e9 [LayerNorm] Rename x1 -> residual	%!s(int64=2) %!d(string=hai) anos
utils	78b7a1dc18 [OPT] Load fp16 weights on CPU before moving to GPU	%!s(int64=2) %!d(string=hai) anos
__init__.py	af4a9ce024 Add missing __init__.py	%!s(int64=2) %!d(string=hai) anos
bert_padding.py	4e38df059e remove numpy dependency	%!s(int64=2) %!d(string=hai) anos
flash_attention.py	41cb909741 Change default dropout value in documentation	%!s(int64=2) %!d(string=hai) anos
flash_attn_interface.py	88c4e5dbf6 Fix the case when dout is not contiguous	%!s(int64=2) %!d(string=hai) anos
flash_attn_triton.py	6b5f271c6d [Triton] Avoid einops repeat by using Tensor.expand	%!s(int64=2) %!d(string=hai) anos
flash_attn_triton_og.py	b0c0db81f6 Implement FlashAttention in Triton	%!s(int64=2) %!d(string=hai) anos
flash_blocksparse_attention.py	5a61cb7729 Rename src -> flash_attn	%!s(int64=2) %!d(string=hai) anos
flash_blocksparse_attn_interface.py	5a61cb7729 Rename src -> flash_attn	%!s(int64=2) %!d(string=hai) anos
fused_softmax.py	ed553e9238 Add Megatron attention implementation for benchmarking	%!s(int64=2) %!d(string=hai) anos