AlpinDale
|
2573b36f6a
feat: allow image embeddings for VLM input (#686)
|
4 months ago |
AlpinDale
|
300f889554
chore: update flashinfer to v0.1.3 (#685)
|
4 months ago |
AlpinDale
|
4ca9aaaf3c
build: add empty device (#684)
|
4 months ago |
AlpinDale
|
b03fa02397
refactor: base worker input refactor for multi-step (#683)
|
4 months ago |
AlpinDale
|
8cfbe62a7c
chore: bump lmfe to v0.10.6 and include triton for tpu and xpu dockerfiles (#682)
|
4 months ago |
AlpinDale
|
06cd48ea5c
chore: use mark_dynamic to reduce TPU compile times (#681)
|
4 months ago |
AlpinDale
|
fa5553b20f
fix: phi3v batch inference with different aspect ratio images (#680)
|
4 months ago |
AlpinDale
|
79d603954e
fix: chunked prefill with v2 block manager (#679)
|
4 months ago |
AlpinDale
|
3bbb3f2086
feat: add numpy implementation of `compute_slot_mapping` (#678)
|
4 months ago |
AlpinDale
|
df208ab4e9
fix: fp8 checkpoints with fused linear modules (#677)
|
4 months ago |
AlpinDale
|
81fa31bcaf
feat: embeddings support for batched OAI endpoint (#676)
|
4 months ago |
AlpinDale
|
c2bb886b2e
fix: reinit procedure in `ModelInputForGPUBuilder` (#675)
|
4 months ago |
AlpinDale
|
bf88c8567e
feat: mamba model support (#674)
|
4 months ago |
AlpinDale
|
8583aefed7
chore: mamba cache single buffer (#673)
|
4 months ago |
AlpinDale
|
19ad952dd4
chore: better stream termination in async engine (#672)
|
4 months ago |
AlpinDale
|
1394008421
chore: decouple `should_modify_greedy_probs_inplace (#671)
|
4 months ago |
AlpinDale
|
2da6a3ec2b
feat: option to apply temperature scaling last (#670)
|
4 months ago |
AlpinDale
|
e3a53712f2
fix: mlpspeculator with padded vocab (#669)
|
4 months ago |
AlpinDale
|
e200775863
feat: enable using fp8 kv and prefix caching with chunked prefill (#668)
|
4 months ago |
AlpinDale
|
ef40c05cd3
fix: minor adjustments to scheduler and block manager (#667)
|
4 months ago |
AlpinDale
|
7df7b8ca53
optimization: reduce end-to-end overhead from python obj allocation (#666)
|
4 months ago |
AlpinDale
|
ea78357d70
fix: deps with TPU dockerfile (#665)
|
4 months ago |
AlpinDale
|
62111fab17
feat: allow serving encoder-decoder models in the API server (#664)
|
4 months ago |
AlpinDale
|
3f49a55f82
feat: add INT8 W8A16 quant for TPU (#663)
|
4 months ago |
AlpinDale
|
5dd0145414
chore: update the env.py script and the bug report template (#662)
|
4 months ago |
AlpinDale
|
1927ce2be4
fix: `get_num_blocks_touched` logic (#661)
|
4 months ago |
AlpinDale
|
ed9a6f97c1
fix: kill api server when pinging dead engine (#660)
|
4 months ago |
AlpinDale
|
6d54f7687d
fix: lora with pipeline parallel (#659)
|
4 months ago |
AlpinDale
|
3405782f24
fix: max_num_batched_tokens should not be limited for lora (#658)
|
4 months ago |
AlpinDale
|
67ee885293
fix: flashinfer outputs (#657)
|
4 months ago |