Commit History

Author SHA1 Message Date
  Tri Dao 9a11f440d3 Bump to v2.5.8 8 months ago
  Tri Dao 35060e7450 [CI] Compile for pytorch 2.2.2 and 2.3.0 8 months ago
  Tri Dao ec6d22143b [CrossEntropy] Change ignored_index -> ignore_index 8 months ago
  Tri Dao 85881f547f Bump to v2.5.7 9 months ago
  Tri Dao 2aea958f89 [CI] Compile with torch 2.3.0.dev20240207 9 months ago
  Tri Dao 656daef4ea Use Cute's local_tile to get gQ, gK, gV 9 months ago
  Tri Dao 9eb3d099c1 Transpose out when swapping seqlen_q and num_groups 9 months ago
  Ivan Komarov f692b98d80 Fix spurious re-compilations of `rotary_kernel` (#911) 9 months ago
  Driss Guessous 23e8fa5a26 Add the option for the macro and note (#893) 9 months ago
  ljss 3e9414f1c3 Minor fix in compute_attn_1rowblock_splitkv (#900) 9 months ago
  Tri Dao 36587c01cb [LayerNorm] Update layer_norm_linear 10 months ago
  Markus Krimmel 6bbc532388 fix: cast the alibi slopes to torch.float32 (#846) 10 months ago
  Driss Guessous 4a73e903da Add in, macrosf for defining __grid_constant__ (#852) 10 months ago
  Grigory Sizov 2a15840f09 Enable paged attention in varlen forward (#831) 10 months ago
  Arvind Sundararajan 26c9e82743 Support ARM builds (#757) 10 months ago
  Chirag Jain 50896ec574 Make nvcc threads configurable via environment variable (#885) 10 months ago
  Tri Dao 6c9e60de56 Bump to v2.5.6 10 months ago
  Tri Dao 6e2fa30797 [CI] Change torch 2.3.0.dev20240126 to 20240105 for nvcr 24.02 10 months ago
  Tri Dao 87a1277653 Bump to v2.5.5 11 months ago
  Tri Dao 2406f28805 Enable headdim 256 backward on consumer GPUs (Ampere, Ada) 11 months ago
  Tri Dao 43950dda45 Bump to v2.5.4 11 months ago
  Tri Dao 4d6b794b3c Update Cutlass to v3.4.1 11 months ago
  Tri Dao b32efb1a4d Don't need to reduce row_sum during online softmax 11 months ago
  Qubitium f45bbb4c94 Optimize compile to 1: avoid oom 2: minimize swap usage 3: avoid threads starvation when letting ninja decide how many workers to spawn or manual MAX_JOBS "guesses". Logic is to take the min value of MAX_JOBS auto-calculated by two metrics: 1: cpu cores 2: free memory. This should allow flash-attn to compile close to the most efficient manner under any consumer/server env. (#832) 11 months ago
  Tri Dao 5cdabc2809 Bump to v2.5.3 11 months ago
  Tri Dao d9a5cb291c Fix dv = torch::empty_like(k) for mha_bwd_varlen as well 11 months ago
  Tri Dao a190df011c Add window_size option to ParallelMHA 11 months ago
  Brian Hirsh 2423cca3ad fix backward for when query and key have different contiguity (#818) 11 months ago
  Grigory Sizov 4687936413 Fix Windows build (#816) 11 months ago
  Tri Dao 61a7772479 Bump to v2.5.2 11 months ago