AlpinDale
|
93cffaf446
add flash_attn back
|
7 months ago |
AlpinDale
|
f970f3f3fb
add base class for VLMs
|
7 months ago |
AlpinDale
|
9e73559eba
make use of batched rotary embedding kernels to support long context lora
|
7 months ago |
AlpinDale
|
1b86cf6164
navi21 fallback to naive attention
|
7 months ago |
AlpinDale
|
0dc8492188
relax tiktoken version
|
7 months ago |
AlpinDale
|
676322dd62
qwen2_moe: mlp_only_layers
|
7 months ago |
AlpinDale
|
14a2d6f624
fix rope error when loading models with different dtypes
|
7 months ago |
AlpinDale
|
0c15965621
fix fp8 kv
|
7 months ago |
AlpinDale
|
2313c97e3d
add cutlass w8a8 kernels (#556)
|
7 months ago |
AlpinDale
|
d4edba99f9
add lora dims for Qwen1.5-32B
|
7 months ago |
AlpinDale
|
eaa06fdd14
fix some f-strings
|
7 months ago |
AlpinDale
|
c58589318f
remove the graph mode func
|
7 months ago |
AlpinDale
|
8e11259e90
missing triton autoconfig for rocm flash attn
|
7 months ago |
AlpinDale
|
c66b1b57b1
Marlin 2:4 sparsity (#555)
|
7 months ago |
AlpinDale
|
ad1c6b86a1
gptq_marlin: enable bfloat16
|
7 months ago |
AlpinDale
|
2ecfa98da9
re-fix mistral nemo
|
7 months ago |
AlpinDale
|
9f3d6205ce
fix ray gpu executor
|
7 months ago |
AlpinDale
|
236be273e5
feat: tensor parallel speculative decoding (#554)
|
7 months ago |
AlpinDale
|
072b30fb42
measure end time within the cuda memory profiler
|
7 months ago |
AlpinDale
|
7bcff4ac03
implement sharded state dict
|
7 months ago |
AlpinDale
|
13e5ffd456
fix distributed_executor_backend in args
|
7 months ago |
AlpinDale
|
a94de94c44
refactor: combine the prefill and decode into a single API (#553)
|
7 months ago |
AlpinDale
|
fe431bb840
check for next port if current is unavailable
|
7 months ago |
AlpinDale
|
033797fd55
refactor throughput benchmark script
|
7 months ago |
AlpinDale
|
c6a501f682
add multiprocessing executor; make ray optional
|
7 months ago |
AlpinDale
|
342346afda
improve hashing function
|
7 months ago |
AlpinDale
|
d7c0dd5b50
fix: do not set the weight to fp8 for fp16 checkpoints
|
7 months ago |
AlpinDale
|
01190e5049
use flash attention for the decoding phase
|
7 months ago |
AlpinDale
|
e42d0b3455
possibly improve ngram efficiency
|
7 months ago |
AlpinDale
|
0cea453d36
automatically detect tensorized models
|
7 months ago |