AlpinDale
|
34b41e0a87
chore: add coordinator to reduce code duplication in tp and pp
|
7 months ago |
AlpinDale
|
d0cca80b8b
feat: support sharded tensorizer models
|
7 months ago |
AlpinDale
|
4d1e613804
chore: minor simplifications
|
7 months ago |
AlpinDale
|
6cecbbff6a
fix: reduce memory footprint of cuda graph by adding output buffer
|
7 months ago |
AlpinDale
|
c975bba905
fix: sharded state loader with lora
|
7 months ago |
AlpinDale
|
e321d80e4e
fix: `prompt_logprobs==0` case
|
7 months ago |
AlpinDale
|
8d77c69cbd
feat: support image processor and add llava example
|
7 months ago |
AlpinDale
|
08f639b8aa
remove duplicate seq_lens_tensor
|
7 months ago |
AlpinDale
|
f40b809d3b
allow using v2 block manager with sliding window
|
7 months ago |
AlpinDale
|
5b0c11d190
support pipeline parallel pynccl groups
|
7 months ago |
AlpinDale
|
de62ceb18c
refactor: eliminate parallel worker per-step task scheduling overhead
|
7 months ago |
AlpinDale
|
656459fd84
make fp8_e4m3 work on nvidia
|
7 months ago |
AlpinDale
|
0aaf2dfc6b
improve parallel logging
|
7 months ago |
AlpinDale
|
9e73559eba
make use of batched rotary embedding kernels to support long context lora
|
7 months ago |
AlpinDale
|
eaa06fdd14
fix some f-strings
|
7 months ago |
AlpinDale
|
c58589318f
remove the graph mode func
|
7 months ago |
AlpinDale
|
072b30fb42
measure end time within the cuda memory profiler
|
7 months ago |
AlpinDale
|
7bcff4ac03
implement sharded state dict
|
7 months ago |
AlpinDale
|
a94de94c44
refactor: combine the prefill and decode into a single API (#553)
|
7 months ago |
AlpinDale
|
01190e5049
use flash attention for the decoding phase
|
8 months ago |
AlpinDale
|
50b7c13db0
refactor: attention selector (#552)
|
8 months ago |
AlpinDale
|
b984fe4a91
refactor custom allreduce to support multiple tp groups
|
8 months ago |
AlpinDale
|
be8154a8a0
feat: proper embeddings API with e5-mistral-7b support
|
8 months ago |
AlpinDale
|
8ae2cce237
refactor pynccl
|
8 months ago |
AlpinDale
|
0e062e66d3
set block size at init
|
8 months ago |
AlpinDale
|
b55381df0e
speedup lora loading times by resuing the cpu dummy lora
|
8 months ago |
AlpinDale
|
3a0d1c7705
add get_name method to attention backends
|
8 months ago |
AlpinDale
|
2351a0e2cd
feat: FlashInfer backend for decoding phase (#548)
|
8 months ago |
AlpinDale
|
35ae01d7ba
refactor: attention metadata term
|
8 months ago |
AlpinDale
|
aed64884c6
feat: prompt logprobs with chunked prefill (#539)
|
8 months ago |