Commit History

Autor SHA1 Mensaxe Data
  AlpinDale 34b41e0a87 chore: add coordinator to reduce code duplication in tp and pp hai 7 meses
  AlpinDale d0cca80b8b feat: support sharded tensorizer models hai 7 meses
  AlpinDale 4d1e613804 chore: minor simplifications hai 7 meses
  AlpinDale 6cecbbff6a fix: reduce memory footprint of cuda graph by adding output buffer hai 7 meses
  AlpinDale c975bba905 fix: sharded state loader with lora hai 7 meses
  AlpinDale e321d80e4e fix: `prompt_logprobs==0` case hai 7 meses
  AlpinDale 8d77c69cbd feat: support image processor and add llava example hai 7 meses
  AlpinDale 08f639b8aa remove duplicate seq_lens_tensor hai 7 meses
  AlpinDale f40b809d3b allow using v2 block manager with sliding window hai 7 meses
  AlpinDale 5b0c11d190 support pipeline parallel pynccl groups hai 8 meses
  AlpinDale de62ceb18c refactor: eliminate parallel worker per-step task scheduling overhead hai 8 meses
  AlpinDale 656459fd84 make fp8_e4m3 work on nvidia hai 8 meses
  AlpinDale 0aaf2dfc6b improve parallel logging hai 8 meses
  AlpinDale 9e73559eba make use of batched rotary embedding kernels to support long context lora hai 8 meses
  AlpinDale eaa06fdd14 fix some f-strings hai 8 meses
  AlpinDale c58589318f remove the graph mode func hai 8 meses
  AlpinDale 072b30fb42 measure end time within the cuda memory profiler hai 8 meses
  AlpinDale 7bcff4ac03 implement sharded state dict hai 8 meses
  AlpinDale a94de94c44 refactor: combine the prefill and decode into a single API (#553) hai 8 meses
  AlpinDale 01190e5049 use flash attention for the decoding phase hai 8 meses
  AlpinDale 50b7c13db0 refactor: attention selector (#552) hai 8 meses
  AlpinDale b984fe4a91 refactor custom allreduce to support multiple tp groups hai 8 meses
  AlpinDale be8154a8a0 feat: proper embeddings API with e5-mistral-7b support hai 8 meses
  AlpinDale 8ae2cce237 refactor pynccl hai 8 meses
  AlpinDale 0e062e66d3 set block size at init hai 8 meses
  AlpinDale b55381df0e speedup lora loading times by resuing the cpu dummy lora hai 8 meses
  AlpinDale 3a0d1c7705 add get_name method to attention backends hai 8 meses
  AlpinDale 2351a0e2cd feat: FlashInfer backend for decoding phase (#548) hai 8 meses
  AlpinDale 35ae01d7ba refactor: attention metadata term hai 8 meses
  AlpinDale aed64884c6 feat: prompt logprobs with chunked prefill (#539) hai 8 meses