Tri Dao
|
9a11f440d3
Bump to v2.5.8
|
8 months ago |
Tri Dao
|
35060e7450
[CI] Compile for pytorch 2.2.2 and 2.3.0
|
8 months ago |
Tri Dao
|
ec6d22143b
[CrossEntropy] Change ignored_index -> ignore_index
|
8 months ago |
Tri Dao
|
85881f547f
Bump to v2.5.7
|
9 months ago |
Tri Dao
|
2aea958f89
[CI] Compile with torch 2.3.0.dev20240207
|
9 months ago |
Tri Dao
|
656daef4ea
Use Cute's local_tile to get gQ, gK, gV
|
9 months ago |
Tri Dao
|
9eb3d099c1
Transpose out when swapping seqlen_q and num_groups
|
9 months ago |
Ivan Komarov
|
f692b98d80
Fix spurious re-compilations of `rotary_kernel` (#911)
|
9 months ago |
Driss Guessous
|
23e8fa5a26
Add the option for the macro and note (#893)
|
9 months ago |
ljss
|
3e9414f1c3
Minor fix in compute_attn_1rowblock_splitkv (#900)
|
9 months ago |
Tri Dao
|
36587c01cb
[LayerNorm] Update layer_norm_linear
|
10 months ago |
Markus Krimmel
|
6bbc532388
fix: cast the alibi slopes to torch.float32 (#846)
|
10 months ago |
Driss Guessous
|
4a73e903da
Add in, macrosf for defining __grid_constant__ (#852)
|
10 months ago |
Grigory Sizov
|
2a15840f09
Enable paged attention in varlen forward (#831)
|
10 months ago |
Arvind Sundararajan
|
26c9e82743
Support ARM builds (#757)
|
10 months ago |
Chirag Jain
|
50896ec574
Make nvcc threads configurable via environment variable (#885)
|
10 months ago |
Tri Dao
|
6c9e60de56
Bump to v2.5.6
|
10 months ago |
Tri Dao
|
6e2fa30797
[CI] Change torch 2.3.0.dev20240126 to 20240105 for nvcr 24.02
|
10 months ago |
Tri Dao
|
87a1277653
Bump to v2.5.5
|
11 months ago |
Tri Dao
|
2406f28805
Enable headdim 256 backward on consumer GPUs (Ampere, Ada)
|
11 months ago |
Tri Dao
|
43950dda45
Bump to v2.5.4
|
11 months ago |
Tri Dao
|
4d6b794b3c
Update Cutlass to v3.4.1
|
11 months ago |
Tri Dao
|
b32efb1a4d
Don't need to reduce row_sum during online softmax
|
11 months ago |
Qubitium
|
f45bbb4c94
Optimize compile to 1: avoid oom 2: minimize swap usage 3: avoid threads starvation when letting ninja decide how many workers to spawn or manual MAX_JOBS "guesses". Logic is to take the min value of MAX_JOBS auto-calculated by two metrics: 1: cpu cores 2: free memory. This should allow flash-attn to compile close to the most efficient manner under any consumer/server env. (#832)
|
11 months ago |
Tri Dao
|
5cdabc2809
Bump to v2.5.3
|
11 months ago |
Tri Dao
|
d9a5cb291c
Fix dv = torch::empty_like(k) for mha_bwd_varlen as well
|
11 months ago |
Tri Dao
|
a190df011c
Add window_size option to ParallelMHA
|
11 months ago |
Brian Hirsh
|
2423cca3ad
fix backward for when query and key have different contiguity (#818)
|
11 months ago |
Grigory Sizov
|
4687936413
Fix Windows build (#816)
|
11 months ago |
Tri Dao
|
61a7772479
Bump to v2.5.2
|
11 months ago |