AlpinDale
|
6600c082bc
chore: pass bias to quant_method.apply
|
4 months ago |
AlpinDale
|
3a53ff1e01
fix: raise an error for no draft token case when draft_tp>1
|
4 months ago |
AlpinDale
|
9fd01e6358
fix: the metrics endpoint was not mounted
|
4 months ago |
AlpinDale
|
00503b9fc1
feat: non-uniform quantization via `compressed-tensors` for llama
|
4 months ago |
AlpinDale
|
2c653a2268
fix: make speculative decoding work with per-request seed
|
4 months ago |
AlpinDale
|
b7a2d52e47
fix: allow using mp executor for pipeline parallel
|
4 months ago |
AlpinDale
|
e90ad4acec
chore: implement fallback for fp8 channelwise using torch._scaled_mm
|
4 months ago |
AlpinDale
|
19340b672e
chore: improve min_capability checking for `compressed-tensors`
|
4 months ago |
AlpinDale
|
b6c4dfce23
chore: refactor TPU model runner and worker
|
4 months ago |
AlpinDale
|
8adc496a2a
fix: use paged attention for bloc swapping/copying in flashinfer
|
4 months ago |
AlpinDale
|
a26f784240
chore: use the LoRA tokenizer in OpenAI API (#599)
|
4 months ago |
AlpinDale
|
8ee8483fcf
`enable_gpu_advance_step` -> `allo_gpu_advance_step`
|
4 months ago |
AlpinDale
|
052a6e1eb6
feat: add SPMD worker execution using Ray accelerated DAG
|
4 months ago |
AlpinDale
|
65a97216a7
fix: avoid secondary error in ShmRingBuffer destructor
|
4 months ago |
AlpinDale
|
6671e3a162
feat: add CPU offloading support (#598)
|
4 months ago |
AlpinDale
|
fb4c01740c
feat: add asymmetric TP support for Qwen2
|
4 months ago |
AlpinDale
|
ee2c5d34da
feat: add fp8 channel-wise weight quantization support
|
4 months ago |
AlpinDale
|
6c4c20652b
feat: pipeline parallel support for mixtral
|
4 months ago |
AlpinDale
|
196e6b64f1
feat: add fp8 dynamic per-token quant kernel
|
4 months ago |
AlpinDale
|
5dbfc200f2
update all benchmarks (#597)
|
4 months ago |
AlpinDale
|
dd18c5042c
move prepare_inputs to the GPU (#596)
|
4 months ago |
AlpinDale
|
22305c91e9
refactor _prepare_model_input_tensor and attn metadata builder for most backends
|
4 months ago |
AlpinDale
|
e8af0d4a3b
fix: type annotation in worker
|
4 months ago |
AlpinDale
|
8c2dd39500
chore: remove multimodal stuff from TPU
|
4 months ago |
AlpinDale
|
6f8beb8583
fix: 4-node crash with PP
|
4 months ago |
AlpinDale
|
d638dc592d
fix: some minor typing issues in spec decode
|
4 months ago |
AlpinDale
|
0b2ae31122
cleanup rocm dockerfile
|
4 months ago |
AlpinDale
|
0429cb2229
fix: only create embeddings and lm_head when necessary for PP
|
4 months ago |
AlpinDale
|
2dfa4e47e6
chore: set seed for dummy weights init
|
4 months ago |
AlpinDale
|
f5d52320da
Port mamba kernels to Aphrodite (#595)
|
4 months ago |