Commit History

Author SHA1 Message Date
  AlpinDale 34b41e0a87 chore: add coordinator to reduce code duplication in tp and pp 7 months ago
  AlpinDale d0cca80b8b feat: support sharded tensorizer models 7 months ago
  AlpinDale 4d1e613804 chore: minor simplifications 7 months ago
  AlpinDale 6cecbbff6a fix: reduce memory footprint of cuda graph by adding output buffer 7 months ago
  AlpinDale c975bba905 fix: sharded state loader with lora 7 months ago
  AlpinDale e321d80e4e fix: `prompt_logprobs==0` case 7 months ago
  AlpinDale 8d77c69cbd feat: support image processor and add llava example 7 months ago
  AlpinDale 08f639b8aa remove duplicate seq_lens_tensor 7 months ago
  AlpinDale f40b809d3b allow using v2 block manager with sliding window 7 months ago
  AlpinDale 5b0c11d190 support pipeline parallel pynccl groups 7 months ago
  AlpinDale de62ceb18c refactor: eliminate parallel worker per-step task scheduling overhead 7 months ago
  AlpinDale 656459fd84 make fp8_e4m3 work on nvidia 7 months ago
  AlpinDale 0aaf2dfc6b improve parallel logging 7 months ago
  AlpinDale 9e73559eba make use of batched rotary embedding kernels to support long context lora 7 months ago
  AlpinDale eaa06fdd14 fix some f-strings 7 months ago
  AlpinDale c58589318f remove the graph mode func 7 months ago
  AlpinDale 072b30fb42 measure end time within the cuda memory profiler 7 months ago
  AlpinDale 7bcff4ac03 implement sharded state dict 7 months ago
  AlpinDale a94de94c44 refactor: combine the prefill and decode into a single API (#553) 7 months ago
  AlpinDale 01190e5049 use flash attention for the decoding phase 8 months ago
  AlpinDale 50b7c13db0 refactor: attention selector (#552) 8 months ago
  AlpinDale b984fe4a91 refactor custom allreduce to support multiple tp groups 8 months ago
  AlpinDale be8154a8a0 feat: proper embeddings API with e5-mistral-7b support 8 months ago
  AlpinDale 8ae2cce237 refactor pynccl 8 months ago
  AlpinDale 0e062e66d3 set block size at init 8 months ago
  AlpinDale b55381df0e speedup lora loading times by resuing the cpu dummy lora 8 months ago
  AlpinDale 3a0d1c7705 add get_name method to attention backends 8 months ago
  AlpinDale 2351a0e2cd feat: FlashInfer backend for decoding phase (#548) 8 months ago
  AlpinDale 35ae01d7ba refactor: attention metadata term 8 months ago
  AlpinDale aed64884c6 feat: prompt logprobs with chunked prefill (#539) 8 months ago