AlpinDale
|
cda0e93a10
abstract away the platform for device capability
|
há 7 meses atrás |
AlpinDale
|
cf472315cc
refactor: isolate FP8 from mixtral
|
há 7 meses atrás |
AlpinDale
|
7d79c0e726
chore: use nvml query to avoid accidental cuda initialization
|
há 7 meses atrás |
AlpinDale
|
ddb3323f94
refactor: have w8a8 compressed tensors use `process_weights_after_load` for fp8
|
há 7 meses atrás |
AlpinDale
|
272c64ab88
chore: allow loading fp8 models with fused qkv/mlp
|
há 7 meses atrás |
AlpinDale
|
cd9ed8623b
fix: cuda version check for fp8 support in the cutlass kernels
|
há 7 meses atrás |
AlpinDale
|
7e54c3916d
chore: factor out epilogues from cutlass kernels
|
há 7 meses atrás |
AlpinDale
|
156f577f79
feat: switch from `PYBIND11_MODULE` to `TORCH_LIBRARY` (#569)
|
há 7 meses atrás |
AlpinDale
|
6cecbbff6a
fix: reduce memory footprint of cuda graph by adding output buffer
|
há 7 meses atrás |
AlpinDale
|
ab5ffb228c
fp8: `act_scale` -> `input_scale`
|
há 7 meses atrás |
AlpinDale
|
e9c0a248dc
fix: support check for fp8 cutlass
|
há 7 meses atrás |
AlpinDale
|
40bc98b363
chore: use cutlass kernels for fp8 if supported
|
há 7 meses atrás |
AlpinDale
|
39b36efabf
fix: mixtral fp8 ckpt loading
|
há 7 meses atrás |
AlpinDale
|
656459fd84
make fp8_e4m3 work on nvidia
|
há 8 meses atrás |
AlpinDale
|
c4c153863e
improve fp8 linear layer performance
|
há 8 meses atrás |
AlpinDale
|
7d23892501
static and dynamic fp8
|
há 8 meses atrás |
AlpinDale
|
36660b55c2
chore: mixtral fp8 w/ static scales (#542)
|
há 8 meses atrás |
AlpinDale
|
b178ae4b4a
chore: generalize linear_method to be quant_method (#540)
|
há 8 meses atrás |
AlpinDale
|
46159b107a
formatting: pt1
|
há 8 meses atrás |
AlpinDale
|
fca911ee0a
vLLM Upstream Sync (#526)
|
há 8 meses atrás |